Skip to content

ci: remove timeout for cargo deny check#4

Closed
nv-anants wants to merge 3 commits into
mainfrom
anants/fix-rust-ci
Closed

ci: remove timeout for cargo deny check#4
nv-anants wants to merge 3 commits into
mainfrom
anants/fix-rust-ci

Conversation

@nv-anants

Copy link
Copy Markdown
Member

What does the PR do?

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@github-actions

github-actions Bot commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Test Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit d750624.

♻️ This comment has been updated with latest results.

@saturley-hall

Copy link
Copy Markdown
Member

Merged into PR #3

kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
elyasmnvidian added a commit that referenced this pull request Oct 14, 2025
elyasmnvidian added a commit that referenced this pull request Oct 15, 2025
ryanolson added a commit that referenced this pull request Dec 2, 2025
Signed-off-by: Ryan Olson <rolson@nvidia.com>
soodoshll added a commit to soodoshll/dynamo that referenced this pull request Feb 9, 2026
… (text mode) (ai-dynamo#4)

* fix

* ups

Signed-off-by: Qidong Su <soodoshll@gmail.com>

* upd

* update

* fix

* upd

* fix

---------

Signed-off-by: Qidong Su <soodoshll@gmail.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 4, 2026
…code-waiter cleanup

Addresses two correctness issues raised by dynamo-ops review on PR 8965.

1. **Race on PrefillReady re-insert** (model_manager.rs)

   Both `register_prefill_router` and `activate_prefill_router` previously
   used a `remove → process → insert` pattern on
   `prefill_router_activators`. Between the `remove` and the `insert` the
   entry is gone from the map, so a concurrent `remove_prefill_activator`
   call (called by the watcher on prefill-component teardown) hits the
   empty map, skips its cleanup, and we then re-insert a stale
   `PrefillReady` for a prefill that's already gone. The next decode
   rebuild's lookup finds the resurrected cache and activates a
   `PrefillRouter` against a dead endpoint.

   Refactor both functions to use DashMap's `Entry` API. The shard lock is
   held for the duration of the OccupiedEntry, so any concurrent
   `remove_prefill_activator` serializes after us and observes the entry
   it needs to clear. The PrefillReady-consume path in
   `register_prefill_router` now reads-and-clones in place without
   removing; the DecodeWaiting/PrefillReady → PrefillReady transition in
   `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an
   atomic value swap that returns the old value.

2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs)

   `handle_delete_helper`'s decode branch only called
   `remove_decode_prefill_waiter` when `remove_worker_set` returned
   `Some(_)`. If decode registered (creating a `DecodeWaiting` activator
   entry via `register_prefill_router`) but `handle_add_helper` failed
   later (e.g., on `kv_chooser_for`, monitor setup, or
   `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever
   landed in the manager — and on later teardown
   `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting`
   entry orphaned in the activator map.

   Drop the `removed.is_some()` guard. The helper itself is state-safe
   (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so
   calling it on a key that's vacant or holds `PrefillReady` is a no-op.
   The prefill-teardown branch keeps its `removed.is_some()` guard
   because `remove_prefill_activator` blanket-removes both states, and
   calling it without a real WorkerSet teardown could orphan a live
   decode-side `DecodeWaiting`.

Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary
OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting,
ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joshuayao pushed a commit to joshuayao/dynamo that referenced this pull request May 13, 2026
…xl v1.0.1

- sglang_ref: e5c2a9b -> v0.5.11
- SGLANG_KERNEL_REF: c668bb67 -> 3fbd4d3372 (latest sgl-kernel-xpu main)
- nixl_ref: 0.10.1 -> v1.0.1

Manually re-applied from 7990.patch commit ai-dynamo#4 (skipped during git am
due to context drift in container/context.yaml).

Signed-off-by: yuanwu <yuan.wu@intel.com>
nv-rinig added a commit that referenced this pull request May 14, 2026
That reference was tied to the CI fan-out section, which is no longer in
this PR.

Signed-off-by: Rini Gupta <rinig@nvidia.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus
exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left
only ~48 MiB of headroom, breaking under both routine inspection
(`kubectl exec --container=power-agent -- python -c "..."` re-imports
the kubernetes client in the same cgroup; observed RSS spike > 128 MiB,
two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and
transient peaks in multi-pod conflict resolution where the UID →
annotation dict + PID → UUID map are both held in memory simultaneously
(linear in pod count per node).

Bump:
  limits.memory:   128Mi → 256Mi   (~3× steady-state RSS headroom)
  requests.memory:  64Mi →  96Mi   (more honest scheduling floor)

CPU envelope unchanged — agent reconciles every 15 s and the heavy
calls (pynvml + list_pod_for_all_namespaces) are well under 200m on
fleet inspection (median: 12 m, p99: 78 m).

Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB,
which is negligible compared to the operational cost of a silently
OOM-killed power-cap controller (the failure mode is a node whose
caps slowly drift back to safeDefaultWatts on the next reconcile that
doesn't fit the budget, with no obvious alert signal — the agent's
Pod restart counter is the only indicator).

Verified: helm template + helm lint clean.

Refs: PR #9682 review, Power Agent live-test finding #4.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus
exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left
only ~48 MiB of headroom, breaking under both routine inspection
(`kubectl exec --container=power-agent -- python -c "..."` re-imports
the kubernetes client in the same cgroup; observed RSS spike > 128 MiB,
two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and
transient peaks in multi-pod conflict resolution where the UID →
annotation dict + PID → UUID map are both held in memory simultaneously
(linear in pod count per node).

Bump:
  limits.memory:   128Mi → 256Mi   (~3× steady-state RSS headroom)
  requests.memory:  64Mi →  96Mi   (more honest scheduling floor)

CPU envelope unchanged — agent reconciles every 15 s and the heavy
calls (pynvml + list_pod_for_all_namespaces) are well under 200m on
fleet inspection (median: 12 m, p99: 78 m).

Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB,
which is negligible compared to the operational cost of a silently
OOM-killed power-cap controller (the failure mode is a node whose
caps slowly drift back to safeDefaultWatts on the next reconcile that
doesn't fit the budget, with no obvious alert signal — the agent's
Pod restart counter is the only indicator).

Verified: helm template + helm lint clean.

Refs: PR #9682 review, Power Agent live-test finding #4.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request Jun 3, 2026
Steady-state RSS of the agent (pynvml + kubernetes client + prometheus
exporter) is ~80 MiB on an idle 8-GPU node — the 128Mi default left
only ~48 MiB of headroom, breaking under both routine inspection
(`kubectl exec --container=power-agent -- python -c "..."` re-imports
the kubernetes client in the same cgroup; observed RSS spike > 128 MiB,
two OOM-kills during 2026-05-21 live testing on AKS dpp-dev-env) and
transient peaks in multi-pod conflict resolution where the UID →
annotation dict + PID → UUID map are both held in memory simultaneously
(linear in pod count per node).

Bump:
  limits.memory:   128Mi → 256Mi   (~3× steady-state RSS headroom)
  requests.memory:  64Mi →  96Mi   (more honest scheduling floor)

CPU envelope unchanged — agent reconciles every 15 s and the heavy
calls (pynvml + list_pod_for_all_namespaces) are well under 200m on
fleet inspection (median: 12 m, p99: 78 m).

Fleet cost: at 1000 GPU nodes the extra 128 MiB per node is 125 GiB,
which is negligible compared to the operational cost of a silently
OOM-killed power-cap controller (the failure mode is a node whose
caps slowly drift back to safeDefaultWatts on the next reconcile that
doesn't fit the budget, with no obvious alert signal — the agent's
Pod restart counter is the only indicator).

Verified: helm template + helm lint clean.

Refs: PR #9682 review, Power Agent live-test finding #4.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request Jun 4, 2026
…oder reuse, doc fixes

Low-risk cleanup batch from the independent review (no decision-path change):

- ai-dynamo#4 chain_augment: add ``predicted_kv_hit_rate`` to ``_PREDICTION_FIELDS``
  so it participates in first-writer-wins partial merge like the other three
  predicted_* fields (was silently dropped in any 2+ plugin PREDICT chain,
  contradicting the proto/Pydantic contract). +2 chain_augment tests.
- #10 engine_adapter: add ``scale_down_capped_by_throughput`` to
  ``_aggregate_disagg_load_reason`` priority (PSM disagg emits it; placed
  between scale_up and scale_down to mirror PSM's _PRIORITY).
- ai-dynamo#11 dead code: drop ``contributing_plugin_ids`` (built, never read) in
  pipeline._run_fanout_stage; drop ``_set_enabled`` + ``_plugin_ids``
  (no caller in PR #1; would KeyError if reached).
- ai-dynamo#18 _encode_fpm: use the canonical
  ``dynamo.common.forward_pass_metrics.encode`` (shared module-level encoder)
  instead of allocating a fresh ``msgspec.msgpack.Encoder`` per tick and
  re-implementing the encoding. Byte-identical wire format; keeps FPM
  serialization in lock-step with the rest of dynamo.
- ai-dynamo#17 transport ABC docstring: timeout is enforced by the transport
  (``call()`` wraps ``asyncio.wait_for``), not the orchestrator — the
  pipeline uses a bare gather to avoid double-counting the deadline.
- ai-dynamo#20 scheduler docstring: note the heartbeat-eviction monitor is not wired
  in this PR (last_heartbeat_at is recorded but unread; monitor is follow-up).
- ai-dynamo#21 transport contract test: 7 inputs (not 8) → 14 cases (multi_pool fixture
  was removed with component_name; comments were stale).
- ai-dynamo#22 metrics test: remove the dead no-op ``pass`` loop in _sample_value.

828 planner tests pass (was 825; +3 chain-augment / merge tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nv-kmcgill53 pushed a commit that referenced this pull request Jun 11, 2026
…ADP routing

In disaggregated serving, TRT-LLM's ConversationAwareADPRouter pins each
conversation to an attention-DP rank (KV locality) by reading
disaggregated_params.conversation_id. Native trtllm-serve populates this from the
client's conversation routing; the Dynamo TRT-LLM workers did not (gap #4), so the
engine's conversation-affinity router never engaged and CTX prefill load scattered
across ranks -> prefill queue -> high TTFT tail.

This plumbs conversation_id end-to-end so the engine router engages:
- RoutingHints gains conversation_id, seeded by the OpenAI preprocessor from
  nvext.session_control.session_id (lib/llm).
- Both the unified worker (llm_engine.py) and the legacy handler
  (request_handlers/handler_base.py) set disaggregated_params.conversation_id on
  the engine request, on prefill and decode.
- Gated by DYN_ENGINE_CONV_AFFINITY=1: when set, the worker stops force-feeding
  attention_dp_rank (an explicit rank is honored before conversation affinity in
  the engine and would bypass ConversationAwareADPRouter).
- dp_rank.conversation_id_from_request helper + unit test; one-shot INFO log to
  confirm the runtime invariants without enabling debug.

Validated on Lyris GB300 (DSV4-Pro, 3xCTX-DEP4 + 1xGEN-DEP8, c460, perf-off) with
engine config kv_cache_routing_conversation_affinity=true: TTFT p95 8.39 -> 4.84s
(~42%), matching native trtllm-serve (4.81); p99 7.12 vs 7.00; 46 -> 50 req/s.
Runtime-confirmed: conversation_id set + attention_dp_rank not forced, on both
PREFILL and DECODE.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants