Skip to content

Update README.md#1

Merged
nnshah1 merged 2 commits into
mainfrom
Update-README.md
Mar 4, 2025
Merged

Update README.md#1
nnshah1 merged 2 commits into
mainfrom
Update-README.md

Conversation

@nvda-mesharma

Copy link
Copy Markdown
Contributor

No description provided.

@nnshah1 nnshah1 merged commit 5033457 into main Mar 4, 2025
@nnshah1 nnshah1 deleted the Update-README.md branch March 4, 2025 17:16
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
copy-pr-bot Bot pushed a commit that referenced this pull request Sep 9, 2025
* add metrics to disconnect

* fmt

* fmt

Signed-off-by: michaelfeil <me@michaelfeil.eu>
grahamking pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
ayushag-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Signed-off-by: ayushag <ayushag@nvidia.com>
zhongdaor-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 22, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request May 29, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request May 29, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request May 31, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 2, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 2, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 2, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
krishung5 added a commit that referenced this pull request Jun 2, 2026
…g launch summary

Two changes to examples/backends/sglang/launch/agg_multimodal_router.sh:

1. Replace the custom echo banner block with print_launch_banner per
   .ai/bash-launch-guidelines.md (matches sibling agg_router.sh:60).
   --no-curl is set because our own wait_ready loop later handles the
   smoke test, and --multimodal flags the script's nature.

2. Build a KV_EVENTS_PORTS array alongside WORKER_PORTS and reference
   both in the summary section, so when the harness sets DYN_SYSTEM_PORT{i}
   the printed URLs match what was actually launched (instead of always
   showing the default formula). Previously CI logs lied about ports
   under dynamic test-port allocation.

Addresses #9561 review: Devin comments #1 + #2, CodeRabbit nitpick on
agg_multimodal_router.sh:65-68.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krishung5 added a commit that referenced this pull request Jun 2, 2026
- decode_handler._extract_mm_hashes docstring now states 16-char hex
  for the SGLang path (was incorrectly claiming the vLLM 64-char shape;
  Devin #5).
- sglang launch banner drops the "Lightseek" prefix; uses "MM Exact
  Routing (SGLang)" to match the public name (Ryan suggestion #1).
- ModelRuntimeConfig.backend_framework field gains a doc-block on its
  motivation — frontend uses it for backend-specific routing hints
  (Ryan #3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 3, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 3, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 4, 2026
Squashed from 8 development commits. See PR description for full
context. Infrastructure-only — builtin plugins + dual-path parity
tests land in the follow-up PR.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 4, 2026
``OrchestratorEngineAdapter.__init__`` hard-coded ``WallClock()``, which
left no seam for replay paths to propagate trace time into the plugin
layer.  Plugin scheduler ``_is_due``, CircuitBreaker cooldown, and
HOLD_LAST cache age all read ``self._clock.monotonic()`` — under a
fast-forward replay (e.g. 1hr trace in <10s real time), wall-clock
barely moves and any plugin with ``execution_interval_seconds`` larger
than the real-time duration never re-fires after its first call.

This is invisible in PR #1's current ship surface (PR #1 has no
builtin plugins, K8s smoke runs in real time, and PSM-only replay
goes through ``_PSMEngineAdapter`` instead) but would block PR #10
(``use_orchestrator=True`` default) — once orchestrator becomes the
default path, mooncake replay must work.

Fix:

- ``OrchestratorEngineAdapter.__init__`` accepts an optional
  ``clock: Clock`` kwarg, defaulting to ``WallClock`` so production
  behaviour is unchanged.
- ``engine_adapter.tick()`` bumps the clock to ``tick_input.now_s``
  at the start of every tick when a ``VirtualClock`` is in play
  (``advance(delta)`` only if ``delta > 0`` — backwards trace time
  is a silent no-op rather than a crash).
- ``ReplayPlannerEngine`` constructs a ``VirtualClock`` and passes it
  to the adapter on the orchestrator path so plugin scheduler sees
  trace time.

Regression tests in ``test_engine_adapter.py``:

- ``test_tick_advances_injected_virtual_clock_to_trace_time``:
  drive two ticks at trace time 180s and 360s, assert clock follows.
- ``test_tick_does_not_advance_clock_backwards``: pre-advance the
  clock past tick_input.now_s, assert no exception and clock stays
  put.
- ``test_default_clock_is_wallclock``: lock production default so a
  future refactor that flips it doesn't silently break K8s.

Full planner suite: 830 passed, 1 skipped, 0 failed.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 4, 2026
…oder reuse, doc fixes

Low-risk cleanup batch from the independent review (no decision-path change):

- #4 chain_augment: add ``predicted_kv_hit_rate`` to ``_PREDICTION_FIELDS``
  so it participates in first-writer-wins partial merge like the other three
  predicted_* fields (was silently dropped in any 2+ plugin PREDICT chain,
  contradicting the proto/Pydantic contract). +2 chain_augment tests.
- #10 engine_adapter: add ``scale_down_capped_by_throughput`` to
  ``_aggregate_disagg_load_reason`` priority (PSM disagg emits it; placed
  between scale_up and scale_down to mirror PSM's _PRIORITY).
- #11 dead code: drop ``contributing_plugin_ids`` (built, never read) in
  pipeline._run_fanout_stage; drop ``_set_enabled`` + ``_plugin_ids``
  (no caller in PR #1; would KeyError if reached).
- #18 _encode_fpm: use the canonical
  ``dynamo.common.forward_pass_metrics.encode`` (shared module-level encoder)
  instead of allocating a fresh ``msgspec.msgpack.Encoder`` per tick and
  re-implementing the encoding. Byte-identical wire format; keeps FPM
  serialization in lock-step with the rest of dynamo.
- #17 transport ABC docstring: timeout is enforced by the transport
  (``call()`` wraps ``asyncio.wait_for``), not the orchestrator — the
  pipeline uses a bare gather to avoid double-counting the deadline.
- #20 scheduler docstring: note the heartbeat-eviction monitor is not wired
  in this PR (last_heartbeat_at is recorded but unread; monitor is follow-up).
- #21 transport contract test: 7 inputs (not 8) → 14 cases (multi_pool fixture
  was removed with component_name; comments were stale).
- #22 metrics test: remove the dead no-op ``pass`` loop in _sample_value.

828 planner tests pass (was 825; +3 chain-augment / merge tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
kangclzjc added a commit that referenced this pull request Jun 4, 2026
Signed-off-by: Kang Zhang <kangz@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
tmonty12 pushed a commit that referenced this pull request Jun 8, 2026
Signed-off-by: Kang Zhang <kangz@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nnshah1 added a commit that referenced this pull request Jun 10, 2026
Address graham-code-review feedback on PR #10351:

- Drop the secondary `owners` map; store `Owner = (instance_id,
  lora_slug)` inline with each entry value. One lock, one source of
  truth, no nested-write-lock hazard, no two-map sync risk.
- `register` takes `&Owner` (one clone inside, not per-file).
- Panic on collision: re-registering the same (slug, suffix, filename)
  with a different owner is a programming error (two attaches of the
  same model+suffix in one process would let detach-#1 wipe files
  detach-#2 still needs). Same-owner re-register is fine and just
  updates the path.
- Doc + local var naming aligned on `instance_id` to match
  `local_model.rs`'s existing usage (the value populates
  `DiscoveryInstance::Model.instance_id`).
- Tests: collision panic + same-owner update path coverage.

Signed-off-by: nnshah1 <neelays@nvidia.com>
Broduker pushed a commit to Broduker/dynamo that referenced this pull request Jun 12, 2026
…i-dynamo#10124)

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: shenls <shenlinshan@kanzhun.com>
Broduker pushed a commit to Broduker/dynamo that referenced this pull request Jun 12, 2026
…i-dynamo#10124)

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: shenls <shenlinshan@kanzhun.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants