ci: update trigger token by nv-anants · Pull Request #2 · ai-dynamo/dynamo

nv-anants · 2025-03-04T00:48:29Z

What does the PR do?

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

Changes: - Move wait_until_events_processed to BEFORE sending requests #2 and #3 (was AFTER each request, causing stale cache view during routing) - Add DIS-1406/analyze_test_errors.py script to categorize A1/A2/A3/A4 errors - Remove unnecessary sleep delays before first request attempt - Reorganize reproduction materials into DIS-1406/ directory Fix13 improves from Fix12 (38/50 → 39/50 passed, 24% → 22% A1 errors) but still shows ~22% A1 failures due to non-deterministic dp_rank selection.

Adds TestQwen235BugReproCase with the exact two picks from the original triage report: * Prefill: tp=4, dp=1, moe_tp=1, moe_ep=4 * Decode: tp=1, dp=8, moe_tp=2, moe_ep=4 Each test pins down a specific symptom that the old profiler path produced: * test_prefill_pick_satisfies_aic_identity — Bug #1 (missing moe kwargs) and Bug #2 (using .tp_size property) would break the identity for this pick. Verifies all five kwargs and the MoE identity. * test_decode_pick_satisfies_aic_identity — same for the decode pick, which was the exact shape that triggered Bug #3 (missing attention_dp_size). * test_tp_size_property_differs_from_aic_tp_size — direct regression guard: documents that the property returns 1 here while AIC's tp_size is 4. Anyone who wires .tp_size back into AIC kwargs will break this. * test_end_to_end_sweep_delivers_complete_kwargs_to_aic — runs both sweeps and asserts every AIC call gets complete kwargs. This is the direct reproducer of the originally-reported failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>

…cycle Address senior-architect review of the metrics infrastructure: #1 (two parallel Prometheus paths): merge `register_prometheus` + `setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings` method on the Rust `LLMEngine` trait. The trait surface shrinks from 7 → 6 methods; both expfmt bridging and the structured component publisher go through one place. The Python ABC keeps its two methods — the PyO3 bridge adapts (engine-author API unchanged for Python). #3 + #8 (framework lifecycle gauges gated on engine opt-in): - `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side `LifecycleGauges` (prometheus crate gauges registered via the runtime's `MetricsRegistry`). Worker constructs them after `engine.start()` succeeds; observes during cleanup/drain. Emits regardless of engine opt-in. - Drop `set_cleanup_time` / `set_drain_time` from the `ComponentMetricsPublisher` trait — engine no longer responsible. - Drop the corresponding methods from `PyComponentMetricsPublisher`. - `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for parity with the legacy entry points (both legacy `main.py` and the unified bridge populate it). The unified bridge now constructs `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when the engine returns no per-rank sources. #5 (kv_cache_hit_rate Optional semantics): clearer docstring on both Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly documented (`None` = no data, `0.0` = legitimate zero-hit). Deferred with rationale (explained in the post-review architect's note): - #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing but no current bug; 5 fields for 5 signals isn't worth the refactor cost yet. Revisit at ~10+ signals. - #4 (per-source cadence): explicit defer in original review. - #7 (conformance in CI): workflow change, separate from trait surface. Conformance kit collapses `check_register_prometheus` + `check_component_metrics` into `check_setup_metrics`. All 68 backend-common unit tests pass. Mocker example updated. PyO3 bridge wheel rebuilt successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>

…nly; extract executor helpers #5: PreparedTransferPlan becomes a struct (Transform fields only); the Direct variant + build_direct + annotated_layouts + the projection-memo cache it backed are all removed. lookup_prepared_plan returns Option<Arc<...>> with None for same-layout direct pairs; plan_and_lower projects AnnotatedLayouts inline on that branch. The microsecond projection-memo against millisecond-scale DMA/RDMA was never measurable, and the Direct cache entry per axis_slice pattern was churn for no gain. #2 surgical: TransferManager::execute_transfer + ::execute_transfer_selection now share resolve_and_dissolve (handle resolve + options dissolve + builder field threading) and dispatch_resolved (executor call + metric wrap). Each public method shrinks from ~80 lines to ~10. execute_planner_cuda_transfer + execute_planner_nixl_transfer share a new planner_prelude helper covering heterogeneous-reject + benchmark-lookup + prepared-lookup + plan_and_lower — only the family validator + policy + post-outcome dispatch differ per entry. Net ~-220 lines. Verified: 743/743 lib tests pass; incompat-smoke.sh 6/6 PASS; cargo fmt + clippy clean on touched files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

Decode-side audit (mapped against the prefill-side iteration 1-5 chain in 31fe452) found one race shared with prefill: the wrapper's cleanup_failed_request has five connector-spawned entry points that can race for the same request_id — lifecycle watcher's failure-sink route, commit_usaa1's local-kick / remote-pipeline / session-missing spawns, and the enqueue spawn's mark_failed. Without a CAS guard, two paths can both observe non-empty unfilled_ids and both call worker_hook.mark_failed_onboarding before either's release_request removes the cd_request_state entry, double-notifying vLLM for the same failed G1 window. Adds cleanup_claimed: AtomicBool on CdRequestState, symmetric with the field on CdRequest, and CAS-gates the wrapper's cleanup_failed_request. commit_usaa1's USAA-1 state rebuild threads the existing flag forward so an already-claimed cleanup keeps its winner across the entry rebuild. Reproducer-first regression test (cd_decode_concurrent_cleanup_marks_failed_once) spawns 8 concurrent on_session_failure calls; verified to fail pre-fix ('got 2' calls) and pass post-fix ('got 1') on a 4-worker multi-thread runtime. Smoke confirms the audit's secondary finding (cooperative cancel routing through cleanup_failed_request) is quiescent: R1+R2 both clean, zero cleanup_failed_request / decode_failure_sink_invoked events for the successful request IDs at /tmp/kvbm-experiments/20260526-011649-two-request. Decode-side audit findings iteration #1: covered. Iterations #2-5 do not apply (different state geometry, no observer-drain shape, no buffer-then-attach pattern). See lib/kvbm-connector/CONTRACT.md "Concurrency" section for the full map.

…SM parity Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when ``last_call_at == -math.inf`` (the "never called" sentinel), so a plugin with ``execution_interval_seconds`` > 0 still fired on the first pipeline tick — bypassing the interval entirely on the first call. This breaks PSM cadence parity once PR #2 introduces builtin plugins like ``BuiltinThroughputPropose`` (``execution_interval_seconds= throughput_adjustment_interval_seconds=180s``). PSM's ``initial_tick(start_s)`` schedules the first throughput fire at ``start_s + 180s``, not at ``start_s``. The buggy first-fire would: 1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick), bumping ``last_call_at`` to 5. 2. At T=180 (PSM's expected first throughput fire) the plugin would not be due: ``180 - 5 = 175 < 180`` → skip. 3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's 180/360/540 cadence. Six concrete symptoms cascade from (3): wasted RPCs on load-only ticks, HOLD_LAST cache that never refreshes on the correct cadence, observability metrics that record "fired" for self-gating no-ops, and divergence from PSM's ``run_throughput_scaling`` flag once the flag is replaced by per-plugin throttle in PR #2. Fix: anchor first-ever fire on ``registered_at`` instead of unconditionally firing. ``_is_due`` becomes: if interval <= 0.0: return True # "every tick" anchor = registered_at if last_call_at == -inf else last_call_at return (now - anchor) >= interval Semantics: "fire every N seconds, starting N seconds after registration." Matches PSM's ``initial_tick`` semantic and aligns with most operators' intuition that ``interval=N`` means "every N seconds" (not "every N seconds plus one free immediate call"). Tests updated: - ``test_active_set::test_first_tick_triggers_even_with_positive_interval`` asserted the *old* (buggy) behaviour and has been replaced by ``test_first_fire_anchored_on_registration_time`` which asserts the new PSM-parity semantic with three time steps (registration, half-window, full-window). - Five other tests that called ``compute_active_set`` at t=0 to get a triggered plugin now ``clock.advance(interval)`` first. Each is annotated so the next reader knows why. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

…g launch summary Two changes to examples/backends/sglang/launch/agg_multimodal_router.sh: 1. Replace the custom echo banner block with print_launch_banner per .ai/bash-launch-guidelines.md (matches sibling agg_router.sh:60). --no-curl is set because our own wait_ready loop later handles the smoke test, and --multimodal flags the script's nature. 2. Build a KV_EVENTS_PORTS array alongside WORKER_PORTS and reference both in the summary section, so when the harness sets DYN_SYSTEM_PORT{i} the printed URLs match what was actually launched (instead of always showing the default formula). Previously CI logs lied about ports under dynamic test-port allocation. Addresses #9561 review: Devin comments #1 + #2, CodeRabbit nitpick on agg_multimodal_router.sh:65-68. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…SM parity Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when ``last_call_at == -math.inf`` (the "never called" sentinel), so a plugin with ``execution_interval_seconds`` > 0 still fired on the first pipeline tick — bypassing the interval entirely on the first call. This breaks PSM cadence parity once PR #2 introduces builtin plugins like ``BuiltinThroughputPropose`` (``execution_interval_seconds= throughput_adjustment_interval_seconds=180s``). PSM's ``initial_tick(start_s)`` schedules the first throughput fire at ``start_s + 180s``, not at ``start_s``. The buggy first-fire would: 1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick), bumping ``last_call_at`` to 5. 2. At T=180 (PSM's expected first throughput fire) the plugin would not be due: ``180 - 5 = 175 < 180`` → skip. 3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's 180/360/540 cadence. Six concrete symptoms cascade from (3): wasted RPCs on load-only ticks, HOLD_LAST cache that never refreshes on the correct cadence, observability metrics that record "fired" for self-gating no-ops, and divergence from PSM's ``run_throughput_scaling`` flag once the flag is replaced by per-plugin throttle in PR #2. Fix: anchor first-ever fire on ``registered_at`` instead of unconditionally firing. ``_is_due`` becomes: if interval <= 0.0: return True # "every tick" anchor = registered_at if last_call_at == -inf else last_call_at return (now - anchor) >= interval Semantics: "fire every N seconds, starting N seconds after registration." Matches PSM's ``initial_tick`` semantic and aligns with most operators' intuition that ``interval=N`` means "every N seconds" (not "every N seconds plus one free immediate call"). Tests updated: - ``test_active_set::test_first_tick_triggers_even_with_positive_interval`` asserted the *old* (buggy) behaviour and has been replaced by ``test_first_fire_anchored_on_registration_time`` which asserts the new PSM-parity semantic with three time steps (registration, half-window, full-window). - Five other tests that called ``compute_active_set`` at t=0 to get a triggered plugin now ``clock.advance(interval)`` first. Each is annotated so the next reader knows why. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

…nfig paths Independent review caught that ``RegisterRequest.requires_produced_fields`` and ``observation_window_seconds`` (the scale_interval cadence contract fields added in commits 13-18) were only reachable via the gRPC gateway self-register path. The two ConfigMap-driven intake paths — static external plugin list and in-process plugin spec — both used Pydantic ``extra="forbid"`` and silently dropped these fields. Net result: - ``throughput_propose`` declaring ``requires_produced_fields=["predictions"]`` could not be configured via ConfigMap, only via gRPC self-register. - ``observation_window_seconds=180`` (e.g. wanting a longer Prometheus aggregation window for stable averaging) was similarly unreachable. Five wiring fixes (top-down following the call chain): 1. ``LocalPlannerOrchestrator.register_internal``: added the two new kwargs, passed through to ``PluginRegistryServer.register_internal``. Until now the orchestrator-level facade silently dropped them. 2. ``ExternalPluginEntry`` (planner_config.py): added Pydantic ``Field`` declarations for both new fields, with descriptions pointing at the cadence contract. 3. ``register_external_from_config`` (orchestrator.py): threaded both fields into the ``RegisterRequest`` construction. Without this, #2 alone wouldn't reach the registry. 4. ``InProcessPluginSpec`` (registry/config.py): added ``needs`` (also previously missing!) + the two new cadence fields. 5. ``load_in_process_plugins`` (in_process_loader.py): passed ``needs`` + ``requires_produced_fields`` + ``observation_window_seconds`` into ``orchestrator.register_internal``. Tests: - ``test_loader_passes_scale_interval_fields_to_registered_plugin``: in-process loader passthrough end-to-end (spec → plugin). - ``test_bootstrap_passes_scale_interval_fields_through``: static external bootstrap passthrough end-to-end. 348 plugin tests pass (+2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines unformatted — `longest_prefix_match`'s collapsed signature and the `merged` binding in l1.rs, and the per-worker gauge assignment + a test assert in metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged them, failing the job. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…SM parity Pre-fix ``PluginScheduler._is_due`` returned True unconditionally when ``last_call_at == -math.inf`` (the "never called" sentinel), so a plugin with ``execution_interval_seconds`` > 0 still fired on the first pipeline tick — bypassing the interval entirely on the first call. This breaks PSM cadence parity once PR #2 introduces builtin plugins like ``BuiltinThroughputPropose`` (``execution_interval_seconds= throughput_adjustment_interval_seconds=180s``). PSM's ``initial_tick(start_s)`` schedules the first throughput fire at ``start_s + 180s``, not at ``start_s``. The buggy first-fire would: 1. Fire ``BuiltinThroughputPropose`` at T=5 (first pipeline tick), bumping ``last_call_at`` to 5. 2. At T=180 (PSM's expected first throughput fire) the plugin would not be due: ``180 - 5 = 175 < 180`` → skip. 3. Plugin re-fires at T=185 — permanently 5s ahead of PSM's 180/360/540 cadence. Six concrete symptoms cascade from (3): wasted RPCs on load-only ticks, HOLD_LAST cache that never refreshes on the correct cadence, observability metrics that record "fired" for self-gating no-ops, and divergence from PSM's ``run_throughput_scaling`` flag once the flag is replaced by per-plugin throttle in PR #2. Fix: anchor first-ever fire on ``registered_at`` instead of unconditionally firing. ``_is_due`` becomes: if interval <= 0.0: return True # "every tick" anchor = registered_at if last_call_at == -inf else last_call_at return (now - anchor) >= interval Semantics: "fire every N seconds, starting N seconds after registration." Matches PSM's ``initial_tick`` semantic and aligns with most operators' intuition that ``interval=N`` means "every N seconds" (not "every N seconds plus one free immediate call"). Tests updated: - ``test_active_set::test_first_tick_triggers_even_with_positive_interval`` asserted the *old* (buggy) behaviour and has been replaced by ``test_first_fire_anchored_on_registration_time`` which asserts the new PSM-parity semantic with three time steps (registration, half-window, full-window). - Five other tests that called ``compute_active_set`` at t=0 to get a triggered plugin now ``clock.advance(interval)`` first. Each is annotated so the next reader knows why. Full planner suite: 830 passed, 1 skipped, 0 failed. Signed-off-by: Kang Zhang <kangz@nvidia.com>

…nfig paths Independent review caught that ``RegisterRequest.requires_produced_fields`` and ``observation_window_seconds`` (the scale_interval cadence contract fields added in commits 13-18) were only reachable via the gRPC gateway self-register path. The two ConfigMap-driven intake paths — static external plugin list and in-process plugin spec — both used Pydantic ``extra="forbid"`` and silently dropped these fields. Net result: - ``throughput_propose`` declaring ``requires_produced_fields=["predictions"]`` could not be configured via ConfigMap, only via gRPC self-register. - ``observation_window_seconds=180`` (e.g. wanting a longer Prometheus aggregation window for stable averaging) was similarly unreachable. Five wiring fixes (top-down following the call chain): 1. ``LocalPlannerOrchestrator.register_internal``: added the two new kwargs, passed through to ``PluginRegistryServer.register_internal``. Until now the orchestrator-level facade silently dropped them. 2. ``ExternalPluginEntry`` (planner_config.py): added Pydantic ``Field`` declarations for both new fields, with descriptions pointing at the cadence contract. 3. ``register_external_from_config`` (orchestrator.py): threaded both fields into the ``RegisterRequest`` construction. Without this, #2 alone wouldn't reach the registry. 4. ``InProcessPluginSpec`` (registry/config.py): added ``needs`` (also previously missing!) + the two new cadence fields. 5. ``load_in_process_plugins`` (in_process_loader.py): passed ``needs`` + ``requires_produced_fields`` + ``observation_window_seconds`` into ``orchestrator.register_internal``. Tests: - ``test_loader_passes_scale_interval_fields_to_registered_plugin``: in-process loader passthrough end-to-end (spec → plugin). - ``test_bootstrap_passes_scale_interval_fields_through``: static external bootstrap passthrough end-to-end. 348 plugin tests pass (+2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines unformatted — `longest_prefix_match`'s collapsed signature and the `merged` binding in l1.rs, and the per-worker gauge assignment + a test assert in metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged them, failing the job. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

Address graham-code-review feedback on PR #10351: - Drop the secondary `owners` map; store `Owner = (instance_id, lora_slug)` inline with each entry value. One lock, one source of truth, no nested-write-lock hazard, no two-map sync risk. - `register` takes `&Owner` (one clone inside, not per-file). - Panic on collision: re-registering the same (slug, suffix, filename) with a different owner is a programming error (two attaches of the same model+suffix in one process would let detach-#1 wipe files detach-#2 still needs). Same-owner re-register is fine and just updates the path. - Doc + local var naming aligned on `instance_id` to match `local_model.rs`'s existing usage (the value populates `DiscoveryInstance::Model.instance_id`). - Tests: collision panic + same-owner update path coverage. Signed-off-by: nnshah1 <neelays@nvidia.com>

Resolve conflicts in components/power_agent/power_agent.py: - _resolve_cap_for_gpu: keep strict multi-pod policy (PR9790 Codex finding #2) over base's lenient None-filtering docstring/intent; the merged code body enforces the strict policy. - _reconcile_gpu: keep actuator-routed PID read + apply_cap (the DCGM dual-actuator path that is the purpose of this PR) over base's inline pynvml; folded base's persistent-cap rationale into the docstring. release.yml and power-agent/README.md auto-merged. Validated: components/power_agent/tests 206 passed. Signed-off-by: Kai Ma <kaim@nvidia.com>

…dya review #2) Replace the monolithic recipes.yaml / benchmarks.yaml with one file per entry + an ordering index + a JSON schema + a validator, to cut merge conflicts and give a cleaner per-entry ownership/validation path. docs/recipes/_catalog/ index.yaml # ordering + inclusion (10 active + 3 deferred) schema.json # JSON Schema (draft 2020-12), one recipe object recipes/<id>.yaml # one object per file, verbatim validate.py # validates BOTH catalogs docs/benchmarks/_catalog/{index.yaml,schema.json,benchmarks/<id>.yaml} validate.py checks index<->file correspondence, id/filename match, no dups, schema conformance, page/asset path resolution, and cross-catalog referential integrity; exits non-zero on failure. Stdlib-only with graceful degradation (prefers pyyaml+jsonschema, falls back to a builtin parser + required-keys check). All entries preserved verbatim (round-trip diff: 0 mismatches). Wiring the validator into CI is a tracked follow-up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

update token

919cd01

nv-anants had a problem deploying to GITLAB March 4, 2025 00:48 — with GitHub Actions Failure

test

1b936c6

nv-anants temporarily deployed to GITLAB March 4, 2025 01:09 — with GitHub Actions Inactive

nv-anants had a problem deploying to GITLAB March 4, 2025 01:09 — with GitHub Actions Failure

nv-anants had a problem deploying to GITLAB March 4, 2025 01:14 — with GitHub Actions Failure

nv-anants had a problem deploying to GITLAB March 4, 2025 01:27 — with GitHub Actions Failure

nv-anants closed this Mar 4, 2025

nv-anants had a problem deploying to GITLAB March 4, 2025 14:32 — with GitHub Actions Failure

richardhuo-nv mentioned this pull request Jun 10, 2025

docs: add message to guide users to the stable version #1457

Merged

This was referenced Aug 1, 2025

docs: Dyn 591 #2247

Merged

[WIP] e2e example of using dynamo runtime to build inference pipelines #2259

Closed

elyasmnvidian added a commit that referenced this pull request Aug 13, 2025

chore: formatting #2

0395195

elyasmnvidian added a commit that referenced this pull request Aug 13, 2025

chore: formatting #2

d0aaa66

elyasmnvidian added a commit that referenced this pull request Aug 13, 2025

chore: formatting #2

05305a7

athreesh added a commit that referenced this pull request Aug 18, 2025

slight updates #2

bd137a7

elyasmnvidian added a commit that referenced this pull request Sep 23, 2025

chore: fix await unit test for harmony #2

f4ca63e

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

elyasmnvidian added a commit that referenced this pull request Sep 26, 2025

chore: clippy fmt #2

16e1a82

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

elyasmnvidian added a commit that referenced this pull request Sep 26, 2025

chore: clippy fmt #2

3e25561

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

elyasmnvidian added a commit that referenced this pull request Sep 26, 2025

chore: clippy fmt #2

f4de215

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

elyasmnvidian added a commit that referenced this pull request Sep 26, 2025

chore: cargo lock #2

c3c368e

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

elyasmnvidian added a commit that referenced this pull request Sep 30, 2025

chore: clippy #2

7c20bef

Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>

elyasmnvidian mentioned this pull request Oct 15, 2025

fix: deepseek tool parsing fixed #3557

Merged

ryanolson added a commit that referenced this pull request Feb 16, 2026

test(velo-events): harden slot_v2 waiter/wakeup regression coverage (#2)

e08c7ec

kthui mentioned this pull request Mar 20, 2026

fix: Sglang Cancellation Aggregated Test. #3896

Open

nnshah1 mentioned this pull request Apr 3, 2026

feat: request lifecycle logging + metrics capture [DIS-1643] #7840

Merged

4 tasks

tanmayv25 mentioned this pull request Apr 15, 2026

DEP: Backend Interface -- LLMEngine ABC and Worker #8251

Open

kangclzjc mentioned this pull request May 15, 2026

feat(planner): plugin framework infrastructure (PR #1 of 2) #9607

Closed

nealvaidya mentioned this pull request May 19, 2026

fix(mocker): gate vLLM waiting-request admission on full ISL fit (#9718) #9721

Closed

8 tasks

dmitry-tokarev-nv mentioned this pull request May 26, 2026

fix(trtllm): restore deps, default CMD, and reduce layer count after upstream base switch #9889

Merged

15 tasks

kangclzjc mentioned this pull request May 29, 2026

feat(planner): plugin framework infrastructure (PR #1 of 2) #10124

Merged

dmitry-tokarev-nv mentioned this pull request May 29, 2026

feat(tests): XPU router e2e test support with TOCTOU port fixes #8675

Merged

jthomson04 mentioned this pull request Jun 3, 2026

perf(frontend): profile-guided hot-path optimizations (tokenizer cache, metric handles, log-trace cap) #10273

Merged

BenHamm mentioned this pull request Jun 14, 2026

docs: add Recipes & Feature Benchmarks Fern surfaces #10628

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: update trigger token#2

ci: update trigger token#2
nv-anants wants to merge 2 commits into
mainfrom
anants/update-ci

nv-anants commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nv-anants commented Mar 4, 2025

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant