Skip to content

CI: Cache image on main branch#7

Closed
saturley-hall wants to merge 2 commits into
mainfrom
harrison/main_branch_cache
Closed

CI: Cache image on main branch#7
saturley-hall wants to merge 2 commits into
mainfrom
harrison/main_branch_cache

Conversation

@saturley-hall

Copy link
Copy Markdown
Member

No description provided.

@github-actions

github-actions Bot commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Test Results

 2 files   2 suites   26s ⏱️
71 tests 71 ✅ 0 💤 0 ❌
89 runs  88 ✅ 1 💤 0 ❌

Results for commit e688847.

@saturley-hall

Copy link
Copy Markdown
Member Author

Closing to reopen

ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:

  - A2  Three planners patch DISJOINT pod sets in one namespace
  - A3  AggPlanner-C uses only the DECODE branch (no PREFILL calls)
  - B2  Power Agent applies 8 distinct caps on one tick
  - B4  Happy-path zero-counter invariant
  - B5  Multi-pod conflict + safe-default behaviour (skipped pending
        Finding #7 — CDI-safe bystander rewrite tracked separately)
  - 8.5.6 / 8.6.2  NVML / DCGM cap-application proven by reading the
                   live driver state (not via mocks)

Layout introduced by this commit:

  components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
      ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
      TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
      shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
      pods with hardcoded annotations) and skips Path-A-only tests
      cleanly when running Path B. DCGM tests gated by
      DCGM_HOSTENGINE_AVAILABLE=1.

  examples/multi-dgd-live-test/
      Manifests + helm values overrides + teardown script + README
      walk-through that an operator applies before running the test:
        00-namespace.yaml                 — namespace + pull secret
        10-power-agent-values-nvml.yaml   — Power Agent NVML helm values
        11-power-agent-values-dcgm.yaml   — Power Agent DCGM helm values
        20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
        30/31/32-vllm-dgd-{A,B,C}.yaml    — three DGDs (Path A)
        40-conflict-bystander-pod.yaml    — multi-pod conflict fixture
        50-stub-workers.yaml              — Path-B stub workers
        99-teardown.sh                    — idempotent cleanup w/ cap-
                                            restore verification
        README.md                         — full reproduction playbook
                                            including a 2026-05-21 live
                                            run record + 11 findings

  docs/design-docs/multi-tenant-test.md
      Design doc (§8 — multi-DGD test plan) that establishes the
      ground-truth topology, the planner / agent contract under test,
      and the rationale for the artifact split above.

  pyproject.toml
      Registers two new test markers consumed by the live test's
      `pytestmark`:
        multi_dgd_live  — scenario selector for this specific file
        power_agent     — shared with components/power_agent/tests/

Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).

Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.

This commit lands the CDI-mode-safe rewrite:

  - Bystander now claims `nvidia.com/gpu: 1` from the device plugin
    (both `requests` and `limits` per K8s extended-resource rules).
    CDI-mode-agnostic — the device plugin owns the actual GPU wiring
    via its own machinery.

  - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
    mirrors the inline Python manifest exactly so YAML-debugging and
    test-debugging agree (the README explicitly calls this out).

  - New helper `_per_gpu_compute_pids()` snapshots
    `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
    agent pod (which has hostPID + NVML access) and returns
    `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
    PID on some index after the bystander reaches Running — that
    index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
    pin needed.

  - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
    `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
    the index-agnostic contract. Blast-radius / recovery asserts key
    off the discovered `assigned_gpu`; the loop over the other GPUs
    iterates `range(8)` and skips that index instead of hard-coding
    `range(1, 8)`.

  - Class-level `@pytest.mark.skip` decorator removed. The supporting
    `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
    in code nor as a runtime branch, since neither path is needed
    once the CDI-safe flow works.

  - `bystander_uuid` class fixture removed (no longer needed — the
    UUID/index discovery is post-creation, not pre).

`examples/multi-dgd-live-test/README.md` updated:
  - Finding #7 marked **Status: fixed** with cross-reference to this
    commit and an explicit description of the new (a)/(b)/(c) flow.
  - Expected-outcome counts updated: Path B now expects 5 PASSED /
    2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
    from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
    next live run.
  - Live-run record row updated to note the rewrite and expected
    PASS outcome.

Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.

Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
…dd SPDX header

Two follow-up fixes to commit 552be92 caught in PR #9683 review.

1. Bystander cannot schedule. The previous CDI-safe rewrite swapped
   `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device-
   plugin claim, expecting the plugin to allocate a free GPU. But
   the 5 stub workers in `50-stub-workers.yaml` already claim every
   `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so
   the bystander's claim sits Pending forever and the conflict never
   materialises.

   Switch to the same pattern the Power Agent itself uses
   (`deploy/helm/charts/power-agent/values.yaml`):
   `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the
   toolkit without consuming a device-plugin claim. The bystander
   pins itself to `cuda:0` (deterministically the first visible
   device — same NVML index 0 as the agent, since both have
   `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery
   loop in the test still verifies which physical GPU the context
   landed on rather than trusting the mapping, so the test stays
   robust under future toolkit / CDI changes.

   Touches the inline `bystander_manifest` in
   `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML
   companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`,
   plus the README's Finding #7 status to document both
   non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI
   rejection AND the `nvidia.com/gpu: 1` over-subscription) and
   the surviving fix-shape.

2. Copyright check failed. `40-conflict-bystander-pod.yaml` was
   imported from `.aic_sandbox/` in commit 004df25 without the
   standard two-line SPDX header that every other YAML in
   `examples/multi-dgd-live-test/` carries (00-namespace through
   50-stub-workers — verified via head-of-file scan). Add the
   header.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:

  - A2  Three planners patch DISJOINT pod sets in one namespace
  - A3  AggPlanner-C uses only the DECODE branch (no PREFILL calls)
  - B2  Power Agent applies 8 distinct caps on one tick
  - B4  Happy-path zero-counter invariant
  - B5  Multi-pod conflict + safe-default behaviour (skipped pending
        Finding #7 — CDI-safe bystander rewrite tracked separately)
  - 8.5.6 / 8.6.2  NVML / DCGM cap-application proven by reading the
                   live driver state (not via mocks)

Layout introduced by this commit:

  components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
      ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
      TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
      shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
      pods with hardcoded annotations) and skips Path-A-only tests
      cleanly when running Path B. DCGM tests gated by
      DCGM_HOSTENGINE_AVAILABLE=1.

  examples/multi-dgd-live-test/
      Manifests + helm values overrides + teardown script + README
      walk-through that an operator applies before running the test:
        00-namespace.yaml                 — namespace + pull secret
        10-power-agent-values-nvml.yaml   — Power Agent NVML helm values
        11-power-agent-values-dcgm.yaml   — Power Agent DCGM helm values
        20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
        30/31/32-vllm-dgd-{A,B,C}.yaml    — three DGDs (Path A)
        40-conflict-bystander-pod.yaml    — multi-pod conflict fixture
        50-stub-workers.yaml              — Path-B stub workers
        99-teardown.sh                    — idempotent cleanup w/ cap-
                                            restore verification
        README.md                         — full reproduction playbook
                                            including a 2026-05-21 live
                                            run record + 11 findings

  docs/design-docs/multi-tenant-test.md
      Design doc (§8 — multi-DGD test plan) that establishes the
      ground-truth topology, the planner / agent contract under test,
      and the rationale for the artifact split above.

  pyproject.toml
      Registers two new test markers consumed by the live test's
      `pytestmark`:
        multi_dgd_live  — scenario selector for this specific file
        power_agent     — shared with components/power_agent/tests/

Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).

Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.

This commit lands the CDI-mode-safe rewrite:

  - Bystander now claims `nvidia.com/gpu: 1` from the device plugin
    (both `requests` and `limits` per K8s extended-resource rules).
    CDI-mode-agnostic — the device plugin owns the actual GPU wiring
    via its own machinery.

  - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
    mirrors the inline Python manifest exactly so YAML-debugging and
    test-debugging agree (the README explicitly calls this out).

  - New helper `_per_gpu_compute_pids()` snapshots
    `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
    agent pod (which has hostPID + NVML access) and returns
    `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
    PID on some index after the bystander reaches Running — that
    index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
    pin needed.

  - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
    `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
    the index-agnostic contract. Blast-radius / recovery asserts key
    off the discovered `assigned_gpu`; the loop over the other GPUs
    iterates `range(8)` and skips that index instead of hard-coding
    `range(1, 8)`.

  - Class-level `@pytest.mark.skip` decorator removed. The supporting
    `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
    in code nor as a runtime branch, since neither path is needed
    once the CDI-safe flow works.

  - `bystander_uuid` class fixture removed (no longer needed — the
    UUID/index discovery is post-creation, not pre).

`examples/multi-dgd-live-test/README.md` updated:
  - Finding #7 marked **Status: fixed** with cross-reference to this
    commit and an explicit description of the new (a)/(b)/(c) flow.
  - Expected-outcome counts updated: Path B now expects 5 PASSED /
    2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
    from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
    next live run.
  - Live-run record row updated to note the rewrite and expected
    PASS outcome.

Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.

Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
…dd SPDX header

Two follow-up fixes to commit 552be92 caught in PR #9683 review.

1. Bystander cannot schedule. The previous CDI-safe rewrite swapped
   `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device-
   plugin claim, expecting the plugin to allocate a free GPU. But
   the 5 stub workers in `50-stub-workers.yaml` already claim every
   `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so
   the bystander's claim sits Pending forever and the conflict never
   materialises.

   Switch to the same pattern the Power Agent itself uses
   (`deploy/helm/charts/power-agent/values.yaml`):
   `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the
   toolkit without consuming a device-plugin claim. The bystander
   pins itself to `cuda:0` (deterministically the first visible
   device — same NVML index 0 as the agent, since both have
   `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery
   loop in the test still verifies which physical GPU the context
   landed on rather than trusting the mapping, so the test stays
   robust under future toolkit / CDI changes.

   Touches the inline `bystander_manifest` in
   `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML
   companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`,
   plus the README's Finding #7 status to document both
   non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI
   rejection AND the `nvidia.com/gpu: 1` over-subscription) and
   the surviving fix-shape.

2. Copyright check failed. `40-conflict-bystander-pod.yaml` was
   imported from `.aic_sandbox/` in commit 004df25 without the
   standard two-line SPDX header that every other YAML in
   `examples/multi-dgd-live-test/` carries (00-namespace through
   50-stub-workers — verified via head-of-file scan). Add the
   header.

Signed-off-by: Kai Ma <kaim@nvidia.com>
ryanolson added a commit that referenced this pull request May 26, 2026
Two regression tests that pin CONTRACT.md's peer-failure-injection
contract: LifecycleEvent::Failed mid-output (P0 #7 on prefill, P0 #9
on decode) must route through cleanup_failed_request EXACTLY ONCE
even when a concurrent path (e.g., a spawn-catch failure-sink call)
fires for the same request_id.

P0 #7 — cd_prefill_frame_error_mid_output_triggers_cleanup
  - Setup: post-USAA prefill state, observer tracking outputs.
  - Trigger: session.inject_lifecycle(LifecycleEvent::Failed { .. })
    PLUS a concurrent coord.cleanup_failed_request(rid, ..) call
    (mirroring the spawn-catch / direct-failure-sink race).
  - Assert: mark_failed_onboarding called EXACTLY ONCE with the
    full G1 window vLLM allocated; coord.states[rid] removed;
    mark_onboarding_complete NEVER called.
  - Probe verified: with the prefill CdRequest.cleanup_claimed CAS
    guard temporarily disabled, the test fails with 'got 2 calls'.

P0 #9 — cd_decode_peer_failure_mid_output_cleans_up
  - Setup: post-USAA decode state, local-kick spawned but unresolved
    (so unfilled_g1_block_ids covers both local-match + remote slices).
  - Trigger: session.inject_lifecycle(LifecycleEvent::Failed { .. })
    PLUS a concurrent CdFailureSink::on_session_failure call.
  - Assert: mark_failed_onboarding called EXACTLY ONCE with the
    full external G1 slice; inflight_budget fully released;
    wrapper cd_request_state evicted; coord.states evicted;
    mark_onboarding_complete NEVER called.
  - Probe verified: with the decode CdRequestState.cleanup_claimed
    CAS guard temporarily disabled, the test fails with 'got 2 calls'.

Both tests use kvbm_engine::p2p::session::LifecycleEvent (already
public) as the deterministic injection knob — no MockSession
extension needed. The 'in-flight session calls return Err' branch of
the original P0 #7 spec is not pinned here because MockSession's
commit / make_available always return Ok regardless of lifecycle
state; that would require a separate MockSession enhancement and
exercises velo-wire semantics rather than connector semantics.
ryanolson added a commit that referenced this pull request May 26, 2026
Codex stop-time review caught: the previous shape of these tests
fired LifecycleEvent::Failed AND a concurrent cleanup call in
parallel, then asserted exactly-once mark_failed_onboarding. That
assertion holds even if the lifecycle path is dead — the direct
call alone produces one cleanup. The test gated the CAS guard but
NOT the wire from session.lifecycle() → watcher → cleanup.

Restructure both tests into two sequential stages:

  Stage 1 — lifecycle path is the ONLY trigger.
  Pre-condition: no mark_failed_onboarding calls yet. Inject
  LifecycleEvent::Failed on the session. Wait. The resulting
  cleanup can only have come from the watcher → cleanup chain.

  Stage 2 — duplicate cleanup is a no-op.
  Fire a second cleanup_failed_request (or on_session_failure)
  after Stage 1 already claimed the CAS. The duplicate must NOT
  re-emit mark_failed_onboarding.

Probe verification on the prefill side: short-circuiting
Self::lifecycle_failure_reason to always return None breaks
Stage 1 — the watcher takes the cooperative branch instead, no
cleanup is dispatched, and Stage 1 times out waiting for
mark_failed_onboarding. The lifecycle wire is gated.

Probe on the decode side surfaced an asymmetry worth pinning in
the test comment: even with lifecycle_failure_reason short-
circuited, the decode test passes because the watcher's
cooperative branch calls state.cancel.cancel(), which propagates
through run_remote_pipeline's tokio::select! arm → spawn-catch
→ wrapper.cleanup_failed_request. So the decode test gates the
OUTCOME contract (peer Failed → cleanup) but does not exclusively
gate the lifecycle-failure-reason route. The shared
Self::lifecycle_failure_reason function is gated by the prefill
test; documenting the dual-path nature in the decode test comment.

Both tests still pass (full suite 242 lib + 15 decode + 29 prefill
+ others green). The CAS-guard discrimination probes from the
previous commit still hold: disabling either CAS makes the
respective Stage 2 / mark_failed_onboarding count assertion fail.
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request Jun 4, 2026
/ai-dynamo#8/ai-dynamo#15/ai-dynamo#16)

Test-only additions for the seams the review flagged as untested.

ai-dynamo#7 _project_scale_to with a real apply outcome (4 cases):
  both components changed → full ScalingDecision; both equal current →
  None (PSM-equivalent no-change); single-component proposal → other count
  stays None; non-apply execute_action → None. Previously every adapter.tick
  test hit only the None/empty path, so a regression in the projection /
  no-change detection would have shipped silently.

ai-dynamo#8 _tick_input_to_context + FPM encoding:
  build a TickInput with traffic (incl kv_hit_rate), worker counts (incl
  scaling-in-progress flags), and a real ForwardPassMetrics; assert the
  PipelineContext.observations mapping and that the FPM bytes decode back
  (key format "<worker_id>/<dp_rank>", canonical encoder). This is the
  ingress glue where the add_observations P1 + the projection live.

ai-dynamo#15 registry mutation during an in-flight tick:
  suspend a PROPOSE plugin mid-gather (asyncio.Event), register a new
  plugin while suspended, release, and assert the late plugin did NOT join
  the in-flight stage (pre-tick snapshot) and the tick completed cleanly —
  then a fresh tick picks it up. Exercises the no-locks invariant that
  scheduler.py/server.py document but no test covered.

ai-dynamo#16 test_tick_diagnostics_extended scope note:
  clarify in the module docstring that plugin_overrides / reconcile_reasons
  / held_over_plugins have no production populator in this PR; these tests
  lock the dataclass contract (defaults / no shared-mutable aliasing), not
  live behavior.

835 planner tests pass (+6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:

  - A2  Three planners patch DISJOINT pod sets in one namespace
  - A3  AggPlanner-C uses only the DECODE branch (no PREFILL calls)
  - B2  Power Agent applies 8 distinct caps on one tick
  - B4  Happy-path zero-counter invariant
  - B5  Multi-pod conflict + safe-default behaviour (skipped pending
        Finding #7 — CDI-safe bystander rewrite tracked separately)
  - 8.5.6 / 8.6.2  NVML / DCGM cap-application proven by reading the
                   live driver state (not via mocks)

Layout introduced by this commit:

  components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
      ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
      TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
      shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
      pods with hardcoded annotations) and skips Path-A-only tests
      cleanly when running Path B. DCGM tests gated by
      DCGM_HOSTENGINE_AVAILABLE=1.

  examples/multi-dgd-live-test/
      Manifests + helm values overrides + teardown script + README
      walk-through that an operator applies before running the test:
        00-namespace.yaml                 — namespace + pull secret
        10-power-agent-values-nvml.yaml   — Power Agent NVML helm values
        11-power-agent-values-dcgm.yaml   — Power Agent DCGM helm values
        20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
        30/31/32-vllm-dgd-{A,B,C}.yaml    — three DGDs (Path A)
        40-conflict-bystander-pod.yaml    — multi-pod conflict fixture
        50-stub-workers.yaml              — Path-B stub workers
        99-teardown.sh                    — idempotent cleanup w/ cap-
                                            restore verification
        README.md                         — full reproduction playbook
                                            including a 2026-05-21 live
                                            run record + 11 findings

  docs/design-docs/multi-tenant-test.md
      Design doc (§8 — multi-DGD test plan) that establishes the
      ground-truth topology, the planner / agent contract under test,
      and the rationale for the artifact split above.

  pyproject.toml
      Registers two new test markers consumed by the live test's
      `pytestmark`:
        multi_dgd_live  — scenario selector for this specific file
        power_agent     — shared with components/power_agent/tests/

Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).

Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.

This commit lands the CDI-mode-safe rewrite:

  - Bystander now claims `nvidia.com/gpu: 1` from the device plugin
    (both `requests` and `limits` per K8s extended-resource rules).
    CDI-mode-agnostic — the device plugin owns the actual GPU wiring
    via its own machinery.

  - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
    mirrors the inline Python manifest exactly so YAML-debugging and
    test-debugging agree (the README explicitly calls this out).

  - New helper `_per_gpu_compute_pids()` snapshots
    `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
    agent pod (which has hostPID + NVML access) and returns
    `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
    PID on some index after the bystander reaches Running — that
    index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
    pin needed.

  - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
    `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
    the index-agnostic contract. Blast-radius / recovery asserts key
    off the discovered `assigned_gpu`; the loop over the other GPUs
    iterates `range(8)` and skips that index instead of hard-coding
    `range(1, 8)`.

  - Class-level `@pytest.mark.skip` decorator removed. The supporting
    `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
    in code nor as a runtime branch, since neither path is needed
    once the CDI-safe flow works.

  - `bystander_uuid` class fixture removed (no longer needed — the
    UUID/index discovery is post-creation, not pre).

`examples/multi-dgd-live-test/README.md` updated:
  - Finding #7 marked **Status: fixed** with cross-reference to this
    commit and an explicit description of the new (a)/(b)/(c) flow.
  - Expected-outcome counts updated: Path B now expects 5 PASSED /
    2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
    from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
    next live run.
  - Live-run record row updated to note the rewrite and expected
    PASS outcome.

Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.

Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
…dd SPDX header

Two follow-up fixes to commit 552be92 caught in PR #9683 review.

1. Bystander cannot schedule. The previous CDI-safe rewrite swapped
   `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device-
   plugin claim, expecting the plugin to allocate a free GPU. But
   the 5 stub workers in `50-stub-workers.yaml` already claim every
   `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so
   the bystander's claim sits Pending forever and the conflict never
   materialises.

   Switch to the same pattern the Power Agent itself uses
   (`deploy/helm/charts/power-agent/values.yaml`):
   `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the
   toolkit without consuming a device-plugin claim. The bystander
   pins itself to `cuda:0` (deterministically the first visible
   device — same NVML index 0 as the agent, since both have
   `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery
   loop in the test still verifies which physical GPU the context
   landed on rather than trusting the mapping, so the test stays
   robust under future toolkit / CDI changes.

   Touches the inline `bystander_manifest` in
   `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML
   companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`,
   plus the README's Finding #7 status to document both
   non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI
   rejection AND the `nvidia.com/gpu: 1` over-subscription) and
   the surviving fix-shape.

2. Copyright check failed. `40-conflict-bystander-pod.yaml` was
   imported from `.aic_sandbox/` in commit 004df25 without the
   standard two-line SPDX header that every other YAML in
   `examples/multi-dgd-live-test/` carries (00-namespace through
   50-stub-workers — verified via head-of-file scan). Add the
   header.

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Cross-DGD complement to test_actuation_knobs_live.py (the single-DGD live
test that already ships with this PR). Validates invariants that
single-DGD tests structurally cannot reach:

  - A2  Three planners patch DISJOINT pod sets in one namespace
  - A3  AggPlanner-C uses only the DECODE branch (no PREFILL calls)
  - B2  Power Agent applies 8 distinct caps on one tick
  - B4  Happy-path zero-counter invariant
  - B5  Multi-pod conflict + safe-default behaviour (skipped pending
        Finding #7 — CDI-safe bystander rewrite tracked separately)
  - 8.5.6 / 8.6.2  NVML / DCGM cap-application proven by reading the
                   live driver state (not via mocks)

Layout introduced by this commit:

  components/src/dynamo/planner/tests/integration/test_multi_dgd_live.py
      ~1000 LoC pytest module. Opt-in via RUN_MULTI_DGD_LIVE=1 +
      TEST_NODE=<8-GPU node> env vars; auto-detects Path A (production
      shape: 3 real DGDs + power-aware planner) vs Path B (5 stub-worker
      pods with hardcoded annotations) and skips Path-A-only tests
      cleanly when running Path B. DCGM tests gated by
      DCGM_HOSTENGINE_AVAILABLE=1.

  examples/multi-dgd-live-test/
      Manifests + helm values overrides + teardown script + README
      walk-through that an operator applies before running the test:
        00-namespace.yaml                 — namespace + pull secret
        10-power-agent-values-nvml.yaml   — Power Agent NVML helm values
        11-power-agent-values-dcgm.yaml   — Power Agent DCGM helm values
        20-nvidia-dcgm-standalone-ds.yaml — DCGM 4.5.1 hostengine DS
        30/31/32-vllm-dgd-{A,B,C}.yaml    — three DGDs (Path A)
        40-conflict-bystander-pod.yaml    — multi-pod conflict fixture
        50-stub-workers.yaml              — Path-B stub workers
        99-teardown.sh                    — idempotent cleanup w/ cap-
                                            restore verification
        README.md                         — full reproduction playbook
                                            including a 2026-05-21 live
                                            run record + 11 findings

  docs/design-docs/multi-tenant-test.md
      Design doc (§8 — multi-DGD test plan) that establishes the
      ground-truth topology, the planner / agent contract under test,
      and the rationale for the artifact split above.

  pyproject.toml
      Registers two new test markers consumed by the live test's
      `pytestmark`:
        multi_dgd_live  — scenario selector for this specific file
        power_agent     — shared with components/power_agent/tests/

Module-level gating fails fast on misconfiguration (missing TEST_NODE,
missing kubeconfig) with actionable error messages, and skips silently
when RUN_MULTI_DGD_LIVE is unset — so this file is a no-op under the
default unit-test invocation. Verified: pytest --collect-only is a
clean no-op without the env vars, and full-import works under
RUN_MULTI_DGD_LIVE=1 (collection fails cleanly on absent k8s creds, not
ImportError).

Refs: docs/design-docs/multi-tenant-test.md §8, PR #9683 review.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
`TestMultiPodConflict` previously created a bystander pod with
`NVIDIA_VISIBLE_DEVICES=<gpu-uuid>` to pin it to a specific GPU. AKS
clusters running the NVIDIA Container Toolkit in CDI device mode (the
dpp-dev-env default, and the direction NVIDIA is pushing for newer
operator releases) reject that with
`unresolvable CDI devices management.nvidia.com/gpu=*` and refuse to
admit the pod — so the test class was carrying a
`@pytest.mark.skip(_CDI_CONFLICT_SKIP_REASON)` decorator that disabled
the whole §8.5.7 contract on every run.

This commit lands the CDI-mode-safe rewrite:

  - Bystander now claims `nvidia.com/gpu: 1` from the device plugin
    (both `requests` and `limits` per K8s extended-resource rules).
    CDI-mode-agnostic — the device plugin owns the actual GPU wiring
    via its own machinery.

  - `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`
    mirrors the inline Python manifest exactly so YAML-debugging and
    test-debugging agree (the README explicitly calls this out).

  - New helper `_per_gpu_compute_pids()` snapshots
    `nvmlDeviceGetComputeRunningProcesses` across all 8 GPUs from the
    agent pod (which has hostPID + NVML access) and returns
    `{gpu_idx: {pid, ...}}`. The test polls this until it sees a new
    PID on some index after the bystander reaches Running — that
    index IS the GPU the device plugin assigned, no NVIDIA_VISIBLE_DEVICES
    pin needed.

  - Test renamed `test_conflict_triggers_safe_default_on_gpu0_only` →
    `test_conflict_triggers_safe_default_on_assigned_gpu` to reflect
    the index-agnostic contract. Blast-radius / recovery asserts key
    off the discovered `assigned_gpu`; the loop over the other GPUs
    iterates `range(8)` and skips that index instead of hard-coding
    `range(1, 8)`.

  - Class-level `@pytest.mark.skip` decorator removed. The supporting
    `_CDI_CONFLICT_SKIP_REASON` constant is dropped too — kept neither
    in code nor as a runtime branch, since neither path is needed
    once the CDI-safe flow works.

  - `bystander_uuid` class fixture removed (no longer needed — the
    UUID/index discovery is post-creation, not pre).

`examples/multi-dgd-live-test/README.md` updated:
  - Finding #7 marked **Status: fixed** with cross-reference to this
    commit and an explicit description of the new (a)/(b)/(c) flow.
  - Expected-outcome counts updated: Path B now expects 5 PASSED /
    2 SKIPPED (was 4 PASSED / 3 SKIPPED — multi-pod conflict moves
    from SKIP to PASS); DCGM pass expects 7 PASSED / 2 SKIPPED on the
    next live run.
  - Live-run record row updated to note the rewrite and expected
    PASS outcome.

Verified locally: test module compiles + collects cleanly. The
end-to-end CDI-vs-non-CDI assignment behaviour is by construction the
device plugin's responsibility, so the failure mode of this rewrite
is bounded to "device plugin returns no GPU within 90 s" — handled
with a clear pytest.fail message that includes a before/after PID
snapshot for debugging.

Refs: PR #9683 review (Finding #7 in
examples/multi-dgd-live-test/README.md).

Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
…dd SPDX header

Two follow-up fixes to commit 552be92 caught in PR #9683 review.

1. Bystander cannot schedule. The previous CDI-safe rewrite swapped
   `NVIDIA_VISIBLE_DEVICES=<uuid>` for a `nvidia.com/gpu: 1` device-
   plugin claim, expecting the plugin to allocate a free GPU. But
   the 5 stub workers in `50-stub-workers.yaml` already claim every
   `nvidia.com/gpu` slot on the test node (1+1+2+2+2 = 8 of 8), so
   the bystander's claim sits Pending forever and the conflict never
   materialises.

   Switch to the same pattern the Power Agent itself uses
   (`deploy/helm/charts/power-agent/values.yaml`):
   `NVIDIA_VISIBLE_DEVICES: "all"` exposes every GPU through the
   toolkit without consuming a device-plugin claim. The bystander
   pins itself to `cuda:0` (deterministically the first visible
   device — same NVML index 0 as the agent, since both have
   `NVIDIA_VISIBLE_DEVICES=all`). The existing PID-based discovery
   loop in the test still verifies which physical GPU the context
   landed on rather than trusting the mapping, so the test stays
   robust under future toolkit / CDI changes.

   Touches the inline `bystander_manifest` in
   `test_multi_dgd_live.py::TestMultiPodConflict` and the YAML
   companion `examples/multi-dgd-live-test/40-conflict-bystander-pod.yaml`,
   plus the README's Finding #7 status to document both
   non-options (the original `NVIDIA_VISIBLE_DEVICES=<uuid>` CDI
   rejection AND the `nvidia.com/gpu: 1` over-subscription) and
   the surviving fix-shape.

2. Copyright check failed. `40-conflict-bystander-pod.yaml` was
   imported from `.aic_sandbox/` in commit 004df25 without the
   standard two-line SPDX header that every other YAML in
   `examples/multi-dgd-live-test/` carries (00-namespace through
   50-stub-workers — verified via head-of-file scan). Add the
   header.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant