Skip to content

ci: update trigger token#2

Closed
nv-anants wants to merge 2 commits into
mainfrom
anants/update-ci
Closed

ci: update trigger token#2
nv-anants wants to merge 2 commits into
mainfrom
anants/update-ci

Conversation

@nv-anants

Copy link
Copy Markdown
Member

What does the PR do?

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@nv-anants nv-anants closed this Mar 4, 2025
elyasmnvidian added a commit that referenced this pull request Aug 13, 2025
elyasmnvidian added a commit that referenced this pull request Aug 13, 2025
elyasmnvidian added a commit that referenced this pull request Aug 13, 2025
athreesh added a commit that referenced this pull request Aug 18, 2025
elyasmnvidian added a commit that referenced this pull request Sep 23, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 30, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
keivenchang added a commit that referenced this pull request Feb 5, 2026
Changes:
- Move wait_until_events_processed to BEFORE sending requests #2 and #3
  (was AFTER each request, causing stale cache view during routing)
- Add DIS-1406/analyze_test_errors.py script to categorize A1/A2/A3/A4 errors
- Remove unnecessary sleep delays before first request attempt
- Reorganize reproduction materials into DIS-1406/ directory

Fix13 improves from Fix12 (38/50 → 39/50 passed, 24% → 22% A1 errors)
but still shows ~22% A1 failures due to non-deterministic dp_rank selection.
tedzhouhk added a commit that referenced this pull request Apr 18, 2026
Adds TestQwen235BugReproCase with the exact two picks from the original
triage report:

* Prefill: tp=4, dp=1, moe_tp=1, moe_ep=4
* Decode: tp=1, dp=8, moe_tp=2, moe_ep=4

Each test pins down a specific symptom that the old profiler path
produced:

* test_prefill_pick_satisfies_aic_identity — Bug #1 (missing moe kwargs)
  and Bug #2 (using .tp_size property) would break the identity for
  this pick. Verifies all five kwargs and the MoE identity.
* test_decode_pick_satisfies_aic_identity — same for the decode pick,
  which was the exact shape that triggered Bug #3 (missing
  attention_dp_size).
* test_tp_size_property_differs_from_aic_tp_size — direct regression
  guard: documents that the property returns 1 here while AIC's tp_size
  is 4. Anyone who wires .tp_size back into AIC kwargs will break this.
* test_end_to_end_sweep_delivers_complete_kwargs_to_aic — runs both
  sweeps and asserts every AIC call gets complete kwargs. This is the
  direct reproducer of the originally-reported failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
ryanolson added a commit that referenced this pull request May 20, 2026
…nly; extract executor helpers

#5: PreparedTransferPlan becomes a struct (Transform fields only); the
Direct variant + build_direct + annotated_layouts + the projection-memo
cache it backed are all removed. lookup_prepared_plan returns
Option<Arc<...>> with None for same-layout direct pairs; plan_and_lower
projects AnnotatedLayouts inline on that branch. The microsecond
projection-memo against millisecond-scale DMA/RDMA was never
measurable, and the Direct cache entry per axis_slice pattern was
churn for no gain.

#2 surgical: TransferManager::execute_transfer +
::execute_transfer_selection now share resolve_and_dissolve (handle
resolve + options dissolve + builder field threading) and
dispatch_resolved (executor call + metric wrap). Each public method
shrinks from ~80 lines to ~10. execute_planner_cuda_transfer +
execute_planner_nixl_transfer share a new planner_prelude helper
covering heterogeneous-reject + benchmark-lookup + prepared-lookup +
plan_and_lower — only the family validator + policy + post-outcome
dispatch differ per entry.

Net ~-220 lines. Verified: 743/743 lib tests pass; incompat-smoke.sh
6/6 PASS; cargo fmt + clippy clean on touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
ryanolson added a commit that referenced this pull request May 26, 2026
Decode-side audit (mapped against the prefill-side iteration 1-5
chain in 31fe452) found one race shared with prefill: the
wrapper's cleanup_failed_request has five connector-spawned entry
points that can race for the same request_id — lifecycle watcher's
failure-sink route, commit_usaa1's local-kick / remote-pipeline /
session-missing spawns, and the enqueue spawn's mark_failed.
Without a CAS guard, two paths can both observe non-empty
unfilled_ids and both call worker_hook.mark_failed_onboarding
before either's release_request removes the cd_request_state
entry, double-notifying vLLM for the same failed G1 window.

Adds cleanup_claimed: AtomicBool on CdRequestState, symmetric
with the field on CdRequest, and CAS-gates the wrapper's
cleanup_failed_request. commit_usaa1's USAA-1 state rebuild
threads the existing flag forward so an already-claimed cleanup
keeps its winner across the entry rebuild.

Reproducer-first regression test
(cd_decode_concurrent_cleanup_marks_failed_once) spawns 8
concurrent on_session_failure calls; verified to fail pre-fix
('got 2' calls) and pass post-fix ('got 1') on a 4-worker
multi-thread runtime.

Smoke confirms the audit's secondary finding (cooperative cancel
routing through cleanup_failed_request) is quiescent: R1+R2 both
clean, zero cleanup_failed_request / decode_failure_sink_invoked
events for the successful request IDs at
/tmp/kvbm-experiments/20260526-011649-two-request.

Decode-side audit findings iteration #1: covered.
Iterations #2-5 do not apply (different state geometry, no
observer-drain shape, no buffer-then-attach pattern). See
lib/kvbm-connector/CONTRACT.md "Concurrency" section for the
full map.
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 2, 2026
…SM parity

Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when
``last_call_at == -math.inf`` (the "never called" sentinel), so a
plugin with ``execution_interval_seconds`` > 0 still fired on the
first pipeline tick — bypassing the interval entirely on the first
call.

This breaks PSM cadence parity once PR #2 introduces builtin plugins
like ``BuiltinThroughputPropose`` (``execution_interval_seconds=
throughput_adjustment_interval_seconds=180s``).  PSM's
``initial_tick(start_s)`` schedules the first throughput fire at
``start_s + 180s``, not at ``start_s``.  The buggy first-fire would:

1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick),
   bumping ``last_call_at`` to 5.
2. At T=180 (PSM's expected first throughput fire) the plugin would
   not be due: ``180 - 5 = 175 < 180`` → skip.
3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's
   180/360/540 cadence.

Six concrete symptoms cascade from (3): wasted RPCs on load-only
ticks, HOLD_LAST cache that never refreshes on the correct cadence,
observability metrics that record "fired" for self-gating no-ops,
and divergence from PSM's ``run_throughput_scaling`` flag once the
flag is replaced by per-plugin throttle in PR #2.

Fix: anchor first-ever fire on ``registered_at`` instead of
unconditionally firing.  ``_is_due`` becomes:

  if interval <= 0.0: return True   # "every tick"
  anchor = registered_at if last_call_at == -inf else last_call_at
  return (now - anchor) >= interval

Semantics: "fire every N seconds, starting N seconds after
registration."  Matches PSM's ``initial_tick`` semantic and aligns
with most operators' intuition that ``interval=N`` means "every N
seconds" (not "every N seconds plus one free immediate call").

Tests updated:

- ``test_active_set::test_first_tick_triggers_even_with_positive_interval``
  asserted the *old* (buggy) behaviour and has been replaced by
  ``test_first_fire_anchored_on_registration_time`` which asserts
  the new PSM-parity semantic with three time steps (registration,
  half-window, full-window).
- Five other tests that called ``compute_active_set`` at t=0 to get
  a triggered plugin now ``clock.advance(interval)`` first.  Each
  is annotated so the next reader knows why.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 2, 2026
…SM parity

Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when
``last_call_at == -math.inf`` (the "never called" sentinel), so a
plugin with ``execution_interval_seconds`` > 0 still fired on the
first pipeline tick — bypassing the interval entirely on the first
call.

This breaks PSM cadence parity once PR #2 introduces builtin plugins
like ``BuiltinThroughputPropose`` (``execution_interval_seconds=
throughput_adjustment_interval_seconds=180s``).  PSM's
``initial_tick(start_s)`` schedules the first throughput fire at
``start_s + 180s``, not at ``start_s``.  The buggy first-fire would:

1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick),
   bumping ``last_call_at`` to 5.
2. At T=180 (PSM's expected first throughput fire) the plugin would
   not be due: ``180 - 5 = 175 < 180`` → skip.
3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's
   180/360/540 cadence.

Six concrete symptoms cascade from (3): wasted RPCs on load-only
ticks, HOLD_LAST cache that never refreshes on the correct cadence,
observability metrics that record "fired" for self-gating no-ops,
and divergence from PSM's ``run_throughput_scaling`` flag once the
flag is replaced by per-plugin throttle in PR #2.

Fix: anchor first-ever fire on ``registered_at`` instead of
unconditionally firing.  ``_is_due`` becomes:

  if interval <= 0.0: return True   # "every tick"
  anchor = registered_at if last_call_at == -inf else last_call_at
  return (now - anchor) >= interval

Semantics: "fire every N seconds, starting N seconds after
registration."  Matches PSM's ``initial_tick`` semantic and aligns
with most operators' intuition that ``interval=N`` means "every N
seconds" (not "every N seconds plus one free immediate call").

Tests updated:

- ``test_active_set::test_first_tick_triggers_even_with_positive_interval``
  asserted the *old* (buggy) behaviour and has been replaced by
  ``test_first_fire_anchored_on_registration_time`` which asserts
  the new PSM-parity semantic with three time steps (registration,
  half-window, full-window).
- Five other tests that called ``compute_active_set`` at t=0 to get
  a triggered plugin now ``clock.advance(interval)`` first.  Each
  is annotated so the next reader knows why.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
krishung5 added a commit that referenced this pull request Jun 2, 2026
…g launch summary

Two changes to examples/backends/sglang/launch/agg_multimodal_router.sh:

1. Replace the custom echo banner block with print_launch_banner per
   .ai/bash-launch-guidelines.md (matches sibling agg_router.sh:60).
   --no-curl is set because our own wait_ready loop later handles the
   smoke test, and --multimodal flags the script's nature.

2. Build a KV_EVENTS_PORTS array alongside WORKER_PORTS and reference
   both in the summary section, so when the harness sets DYN_SYSTEM_PORT{i}
   the printed URLs match what was actually launched (instead of always
   showing the default formula). Previously CI logs lied about ports
   under dynamic test-port allocation.

Addresses #9561 review: Devin comments #1 + #2, CodeRabbit nitpick on
agg_multimodal_router.sh:65-68.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 3, 2026
…SM parity

Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when
``last_call_at == -math.inf`` (the "never called" sentinel), so a
plugin with ``execution_interval_seconds`` > 0 still fired on the
first pipeline tick — bypassing the interval entirely on the first
call.

This breaks PSM cadence parity once PR #2 introduces builtin plugins
like ``BuiltinThroughputPropose`` (``execution_interval_seconds=
throughput_adjustment_interval_seconds=180s``).  PSM's
``initial_tick(start_s)`` schedules the first throughput fire at
``start_s + 180s``, not at ``start_s``.  The buggy first-fire would:

1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick),
   bumping ``last_call_at`` to 5.
2. At T=180 (PSM's expected first throughput fire) the plugin would
   not be due: ``180 - 5 = 175 < 180`` → skip.
3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's
   180/360/540 cadence.

Six concrete symptoms cascade from (3): wasted RPCs on load-only
ticks, HOLD_LAST cache that never refreshes on the correct cadence,
observability metrics that record "fired" for self-gating no-ops,
and divergence from PSM's ``run_throughput_scaling`` flag once the
flag is replaced by per-plugin throttle in PR #2.

Fix: anchor first-ever fire on ``registered_at`` instead of
unconditionally firing.  ``_is_due`` becomes:

  if interval <= 0.0: return True   # "every tick"
  anchor = registered_at if last_call_at == -inf else last_call_at
  return (now - anchor) >= interval

Semantics: "fire every N seconds, starting N seconds after
registration."  Matches PSM's ``initial_tick`` semantic and aligns
with most operators' intuition that ``interval=N`` means "every N
seconds" (not "every N seconds plus one free immediate call").

Tests updated:

- ``test_active_set::test_first_tick_triggers_even_with_positive_interval``
  asserted the *old* (buggy) behaviour and has been replaced by
  ``test_first_fire_anchored_on_registration_time`` which asserts
  the new PSM-parity semantic with three time steps (registration,
  half-window, full-window).
- Five other tests that called ``compute_active_set`` at t=0 to get
  a triggered plugin now ``clock.advance(interval)`` first.  Each
  is annotated so the next reader knows why.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 4, 2026
…nfig paths

Independent review caught that ``RegisterRequest.requires_produced_fields``
and ``observation_window_seconds`` (the scale_interval cadence contract
fields added in commits 13-18) were only reachable via the gRPC gateway
self-register path.  The two ConfigMap-driven intake paths — static
external plugin list and in-process plugin spec — both used Pydantic
``extra="forbid"`` and silently dropped these fields.  Net result:

- ``throughput_propose`` declaring ``requires_produced_fields=["predictions"]``
  could not be configured via ConfigMap, only via gRPC self-register.
- ``observation_window_seconds=180`` (e.g. wanting a longer Prometheus
  aggregation window for stable averaging) was similarly unreachable.

Five wiring fixes (top-down following the call chain):

1. ``LocalPlannerOrchestrator.register_internal``: added the two new
   kwargs, passed through to ``PluginRegistryServer.register_internal``.
   Until now the orchestrator-level facade silently dropped them.

2. ``ExternalPluginEntry`` (planner_config.py): added Pydantic
   ``Field`` declarations for both new fields, with descriptions
   pointing at the cadence contract.

3. ``register_external_from_config`` (orchestrator.py): threaded both
   fields into the ``RegisterRequest`` construction.  Without this,
   #2 alone wouldn't reach the registry.

4. ``InProcessPluginSpec`` (registry/config.py): added ``needs``
   (also previously missing!) + the two new cadence fields.

5. ``load_in_process_plugins`` (in_process_loader.py): passed
   ``needs`` + ``requires_produced_fields`` + ``observation_window_seconds``
   into ``orchestrator.register_internal``.

Tests:
- ``test_loader_passes_scale_interval_fields_to_registered_plugin``:
  in-process loader passthrough end-to-end (spec → plugin).
- ``test_bootstrap_passes_scale_interval_fields_through``:
  static external bootstrap passthrough end-to-end.

348 plugin tests pass (+2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 4, 2026
…SM parity

Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when
``last_call_at == -math.inf`` (the "never called" sentinel), so a
plugin with ``execution_interval_seconds`` > 0 still fired on the
first pipeline tick — bypassing the interval entirely on the first
call.

This breaks PSM cadence parity once PR #2 introduces builtin plugins
like ``BuiltinThroughputPropose`` (``execution_interval_seconds=
throughput_adjustment_interval_seconds=180s``).  PSM's
``initial_tick(start_s)`` schedules the first throughput fire at
``start_s + 180s``, not at ``start_s``.  The buggy first-fire would:

1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick),
   bumping ``last_call_at`` to 5.
2. At T=180 (PSM's expected first throughput fire) the plugin would
   not be due: ``180 - 5 = 175 < 180`` → skip.
3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's
   180/360/540 cadence.

Six concrete symptoms cascade from (3): wasted RPCs on load-only
ticks, HOLD_LAST cache that never refreshes on the correct cadence,
observability metrics that record "fired" for self-gating no-ops,
and divergence from PSM's ``run_throughput_scaling`` flag once the
flag is replaced by per-plugin throttle in PR #2.

Fix: anchor first-ever fire on ``registered_at`` instead of
unconditionally firing.  ``_is_due`` becomes:

  if interval <= 0.0: return True   # "every tick"
  anchor = registered_at if last_call_at == -inf else last_call_at
  return (now - anchor) >= interval

Semantics: "fire every N seconds, starting N seconds after
registration."  Matches PSM's ``initial_tick`` semantic and aligns
with most operators' intuition that ``interval=N`` means "every N
seconds" (not "every N seconds plus one free immediate call").

Tests updated:

- ``test_active_set::test_first_tick_triggers_even_with_positive_interval``
  asserted the *old* (buggy) behaviour and has been replaced by
  ``test_first_fire_anchored_on_registration_time`` which asserts
  the new PSM-parity semantic with three time steps (registration,
  half-window, full-window).
- Five other tests that called ``compute_active_set`` at t=0 to get
  a triggered plugin now ``clock.advance(interval)`` first.  Each
  is annotated so the next reader knows why.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 4, 2026
…nfig paths

Independent review caught that ``RegisterRequest.requires_produced_fields``
and ``observation_window_seconds`` (the scale_interval cadence contract
fields added in commits 13-18) were only reachable via the gRPC gateway
self-register path.  The two ConfigMap-driven intake paths — static
external plugin list and in-process plugin spec — both used Pydantic
``extra="forbid"`` and silently dropped these fields.  Net result:

- ``throughput_propose`` declaring ``requires_produced_fields=["predictions"]``
  could not be configured via ConfigMap, only via gRPC self-register.
- ``observation_window_seconds=180`` (e.g. wanting a longer Prometheus
  aggregation window for stable averaging) was similarly unreachable.

Five wiring fixes (top-down following the call chain):

1. ``LocalPlannerOrchestrator.register_internal``: added the two new
   kwargs, passed through to ``PluginRegistryServer.register_internal``.
   Until now the orchestrator-level facade silently dropped them.

2. ``ExternalPluginEntry`` (planner_config.py): added Pydantic
   ``Field`` declarations for both new fields, with descriptions
   pointing at the cadence contract.

3. ``register_external_from_config`` (orchestrator.py): threaded both
   fields into the ``RegisterRequest`` construction.  Without this,
   #2 alone wouldn't reach the registry.

4. ``InProcessPluginSpec`` (registry/config.py): added ``needs``
   (also previously missing!) + the two new cadence fields.

5. ``load_in_process_plugins`` (in_process_loader.py): passed
   ``needs`` + ``requires_produced_fields`` + ``observation_window_seconds``
   into ``orchestrator.register_internal``.

Tests:
- ``test_loader_passes_scale_interval_fields_to_registered_plugin``:
  in-process loader passthrough end-to-end (spec → plugin).
- ``test_bootstrap_passes_scale_interval_fields_through``:
  static external bootstrap passthrough end-to-end.

348 plugin tests pass (+2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
nnshah1 added a commit that referenced this pull request Jun 10, 2026
Address graham-code-review feedback on PR #10351:

- Drop the secondary `owners` map; store `Owner = (instance_id,
  lora_slug)` inline with each entry value. One lock, one source of
  truth, no nested-write-lock hazard, no two-map sync risk.
- `register` takes `&Owner` (one clone inside, not per-file).
- Panic on collision: re-registering the same (slug, suffix, filename)
  with a different owner is a programming error (two attaches of the
  same model+suffix in one process would let detach-#1 wipe files
  detach-#2 still needs). Same-owner re-register is fine and just
  updates the path.
- Doc + local var naming aligned on `instance_id` to match
  `local_model.rs`'s existing usage (the value populates
  `DiscoveryInstance::Model.instance_id`).
- Tests: collision panic + same-owner update path coverage.

Signed-off-by: nnshah1 <neelays@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 10, 2026
Resolve conflicts in components/power_agent/power_agent.py:
- _resolve_cap_for_gpu: keep strict multi-pod policy (PR9790 Codex
  finding #2) over base's lenient None-filtering docstring/intent; the
  merged code body enforces the strict policy.
- _reconcile_gpu: keep actuator-routed PID read + apply_cap (the DCGM
  dual-actuator path that is the purpose of this PR) over base's inline
  pynvml; folded base's persistent-cap rationale into the docstring.

release.yml and power-agent/README.md auto-merged.
Validated: components/power_agent/tests 206 passed.

Signed-off-by: Kai Ma <kaim@nvidia.com>
BenHamm added a commit that referenced this pull request Jun 14, 2026
…dya review #2)

Replace the monolithic recipes.yaml / benchmarks.yaml with one file per
entry + an ordering index + a JSON schema + a validator, to cut merge
conflicts and give a cleaner per-entry ownership/validation path.

  docs/recipes/_catalog/
    index.yaml          # ordering + inclusion (10 active + 3 deferred)
    schema.json         # JSON Schema (draft 2020-12), one recipe object
    recipes/<id>.yaml   # one object per file, verbatim
    validate.py         # validates BOTH catalogs
  docs/benchmarks/_catalog/{index.yaml,schema.json,benchmarks/<id>.yaml}

validate.py checks index<->file correspondence, id/filename match, no
dups, schema conformance, page/asset path resolution, and cross-catalog
referential integrity; exits non-zero on failure. Stdlib-only with
graceful degradation (prefers pyyaml+jsonschema, falls back to a builtin
parser + required-keys check). All entries preserved verbatim
(round-trip diff: 0 mismatches). Wiring the validator into CI is a
tracked follow-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant