Skip to content

init: helm charts#5

Merged
whoisj merged 2 commits into
mainfrom
whoisj/init-helm-chart
Mar 4, 2025
Merged

init: helm charts#5
whoisj merged 2 commits into
mainfrom
whoisj/init-helm-chart

Conversation

@whoisj

@whoisj whoisj commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Adds a basic Helm chart for a generic dynemo container.
Lack specific support for API server vs. worker, and OpenAI vs. VLLM vs. TRTLLM, etc.
Future iteration may include specific charts for those topics.
Should be nominally compatible with upcoming Compound AI tooling while retaining flexibility required by sophisticated customers.

Adds a basic test framework for validating Helm Chart using helm template w/ regular expressions.

Adds a set of 3 basic tests:

  • basic positive test
  • simple invalid values test
  • simple GPU resource specification test
Reviewer Instructions

Please focus on the following files:

  • deploy/Kubernetes/common/chart/Chart.yaml
  • deploy/Kubernetes/common/chart/values.yaml
  • deploy/Kubernetes/common/chart/templates/_helpers.tpl
  • deploy/Kubernetes/common/chart/templates/component-deployment.yaml

Other code is less important, but reviews are welcome
... and (mostly) ignore / accept the pwsh code for now.

DLIS-8146

@whoisj

whoisj commented Mar 4, 2025

Copy link
Copy Markdown
Contributor Author

@hutm, ideally you can offload the review effort to Julien Mancuso. I would add him, but I do not know his Github alias and the search feature isn't finding it for me. Thanks.

note: Julien has since been added to the repository and to the reviewers list for this PR.

@github-actions

github-actions Bot commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Test Results

 2 files   2 suites   25s ⏱️
73 tests 73 ✅ 0 💤 0 ❌
91 runs  90 ✅ 1 💤 0 ❌

Results for commit 00b7104.

♻️ This comment has been updated with latest results.

@whoisj whoisj requested a review from julienmancuso March 4, 2025 18:09
Comment thread deploy/Kubernetes/common/chart/templates/_helpers.tpl Outdated

@ishandhanani ishandhanani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I would wait for Julien to give the final review before we get this in. Thanks for cooking on this J!

Comment thread deploy/Kubernetes/common/chart/values.yaml Outdated

@julienmancuso julienmancuso left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

whoisj added 2 commits March 4, 2025 16:11
Adds a basic Helm chart for a generic dynemo container.
Lack specific support for API server vs. worker, and OpenAI vs. VLLM vs. TRTLLM, etc.
Future iteration may include specific charts for those topics.
Should be nominally compatible with upcoming Compound AI tooling while retaining flexibility required by sophisticated customers.

Adds a basic test framework for validating Helm Chart using `helm template` w/ regular expressions.

Adds a set of 3 basic tests:
- basic positive test
- simple invalid values test
- simple GPU resource specification test
This chang excludes YAML files contained in a Helm chart's templates/ directory as they'll never pass a simple-minded YAML linter.
@whoisj whoisj force-pushed the whoisj/init-helm-chart branch from 7f42f6c to 00b7104 Compare March 4, 2025 21:11
@whoisj whoisj requested a review from biswapanda as a code owner March 4, 2025 21:11
@whoisj whoisj merged commit c909567 into main Mar 4, 2025
@whoisj whoisj deleted the whoisj/init-helm-chart branch March 4, 2025 21:34
ishandhanani added a commit that referenced this pull request Jan 31, 2026
Add agg_vision.sh for single-GPU aggregated multimodal serving with
Qwen2.5-VL. Fix compliance test #5 to include required "detail" field
on input_image content per the OpenResponses spec.

Tested: 5/6 compliance tests pass against Qwen2.5-VL-7B-Instruct
(tool calling fails as expected -- VL model lacks structured tool use).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MatejKosec pushed a commit that referenced this pull request Feb 6, 2026
Add agg_vision.sh for single-GPU aggregated multimodal serving with
Qwen2.5-VL. Fix compliance test #5 to include required "detail" field
on input_image content per the OpenResponses spec.

Tested: 5/6 compliance tests pass against Qwen2.5-VL-7B-Instruct
(tool calling fails as expected -- VL model lacks structured tool use).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MatejKosec pushed a commit that referenced this pull request Feb 10, 2026
Add agg_vision.sh for single-GPU aggregated multimodal serving with
Qwen2.5-VL. Fix compliance test #5 to include required "detail" field
on input_image content per the OpenResponses spec.

Tested: 5/6 compliance tests pass against Qwen2.5-VL-7B-Instruct
(tool calling fails as expected -- VL model lacks structured tool use).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MatejKosec pushed a commit that referenced this pull request Feb 12, 2026
Add agg_vision.sh for single-GPU aggregated multimodal serving with
Qwen2.5-VL. Fix compliance test #5 to include required "detail" field
on input_image content per the OpenResponses spec.

Tested: 5/6 compliance tests pass against Qwen2.5-VL-7B-Instruct
(tool calling fails as expected -- VL model lacks structured tool use).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
keivenchang added a commit that referenced this pull request Apr 30, 2026
CodeRabbit flagged the new tests as carrying internal Linear ticket
references in their doc-comments — repo policy forbids Linear refs in
source code. Replaced each occurrence with neutral wording:

- DIS-1842 work-item #1 → cross-parser tool_choice parametrisation
  work-item (tracked separately)
- DIS-1842 work-item #5 → cross-parser finish_reason mapping work-item
  (tracked separately)
- Universal gap noted in DIS-1842 → Universal gap noted in the test
  taxonomy

Linear refs remain in the PR description (allowed per repo convention).
424 parser tests + 16 tool_choice tests pass.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
keivenchang added a commit that referenced this pull request May 7, 2026
…ery path

The previous commit switched the PyO3 batch binding to
detect_and_parse_tool_call_with_recovery (Ayush #5), which legitimately
changed Dynamo's behavior on missing-end-token / malformed inputs:
the recovery path now extracts a tool call where the streaming-safe
path used to drop it.

* Updated 5 fixture expected values to match Dynamo's recovery output:
    glm47/PARSER.batch.5
    minimax_m2/PARSER.batch.5
    nemotron_deci/PARSER.batch.4
    nemotron_deci/PARSER.batch.5
    qwen3_coder/PARSER.batch.5
* Added vllm/sglang KNOWN_DIVERGENCES entries for the same cases since
  those parsers still drop where Dynamo now recovers (impl-defined
  recovery contract per PARSER_CASES.md). sglang/qwen3_coder/batch.5
  intentionally omitted — sglang's qwen3_coder detector recovers
  identically to Dynamo and matches the new expected.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
keivenchang added a commit that referenced this pull request May 7, 2026
…ery path

The previous commit switched the PyO3 batch binding to
detect_and_parse_tool_call_with_recovery (Ayush #5), which legitimately
changed Dynamo's behavior on missing-end-token / malformed inputs:
the recovery path now extracts a tool call where the streaming-safe
path used to drop it.

* Updated 5 fixture expected values to match Dynamo's recovery output:
    glm47/PARSER.batch.5
    minimax_m2/PARSER.batch.5
    nemotron_deci/PARSER.batch.4
    nemotron_deci/PARSER.batch.5
    qwen3_coder/PARSER.batch.5
* Added vllm/sglang KNOWN_DIVERGENCES entries for the same cases since
  those parsers still drop where Dynamo now recovers (impl-defined
  recovery contract per PARSER_CASES.md). sglang/qwen3_coder/batch.5
  intentionally omitted — sglang's qwen3_coder detector recovers
  identically to Dynamo and matches the new expected.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
kaim-eng added a commit that referenced this pull request May 11, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.

Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
  (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
  - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
    `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
    `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
    usage instructions.
  - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
    `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
  - 5 `scripts/inspect_*.py` cluster probes parameterized via
    `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
    rather than a hard-coded namespace.
  - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
    `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
    one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
  not committed" convention (matches `examples/deployments/powerplanner/
  .tmp-gp-minimal.yaml`) and to block the four common root-level
  personal-scratch files from sneaking back in via `git add .`.

dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
  `deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
  `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
  `envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
  `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
  cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
  `kubectl wait --for=jsonpath='{.status.state}'=successful` form
  in both §4 and the Checklist (removes the manual-`-w` divergence).

Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
  - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
    590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
    draft.` with last-validation date.
  - §3.1: added an explicit note that `aic_interpolation` and `mode`
    are pre-existing PlannerConfig fields (owned by
    `monitoring/aic_interpolation.py`) and intentionally absent from
    the Names Registry.
  - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
    it correctly points at the *throughput regression* case rather
    than reading as a blanket "config revert".
  - §6.5 pseudocode: replaced module-level `self.device_count` with
    `pynvml.nvmlDeviceGetCount()` (the original was syntactically
    incorrect outside the daemon class).
  - §13 Open Question #11: corrected the duplicated
    `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
    `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
    §5.3.
* powerplanner-testbed-design.md
  - §7: added a "Numeric-suffix convention" note explaining the
    intentional D21/E21 ID collision (IDs unique by filename + tuple,
    not renumbered).
  - §11: corrected "Six guards" -> "Seven guards" to match the seven
    items actually listed.
  - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
  - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
    + 30 parametrized) so the listing matches the actual run.

Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
  Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
  test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
  Windows-only environment limitations (cp1252 codec, `os.killpg`
  POSIX-only, missing `filterpy`) confirmed via `git stash` to be
  untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
  `tests/fault_tolerance/...` (pre-existing, separate component) and
  `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.

Co-authored-by: Cursor <cursoragent@cursor.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.

Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
  (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
  - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
    `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
    `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
    usage instructions.
  - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
    `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
  - 5 `scripts/inspect_*.py` cluster probes parameterized via
    `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
    rather than a hard-coded namespace.
  - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
    `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
    one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
  not committed" convention (matches `examples/deployments/powerplanner/
  .tmp-gp-minimal.yaml`) and to block the four common root-level
  personal-scratch files from sneaking back in via `git add .`.

dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
  `deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
  `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
  `envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
  `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
  cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
  `kubectl wait --for=jsonpath='{.status.state}'=successful` form
  in both §4 and the Checklist (removes the manual-`-w` divergence).

Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
  - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
    590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
    draft.` with last-validation date.
  - §3.1: added an explicit note that `aic_interpolation` and `mode`
    are pre-existing PlannerConfig fields (owned by
    `monitoring/aic_interpolation.py`) and intentionally absent from
    the Names Registry.
  - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
    it correctly points at the *throughput regression* case rather
    than reading as a blanket "config revert".
  - §6.5 pseudocode: replaced module-level `self.device_count` with
    `pynvml.nvmlDeviceGetCount()` (the original was syntactically
    incorrect outside the daemon class).
  - §13 Open Question #11: corrected the duplicated
    `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
    `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
    §5.3.
* powerplanner-testbed-design.md
  - §7: added a "Numeric-suffix convention" note explaining the
    intentional D21/E21 ID collision (IDs unique by filename + tuple,
    not renumbered).
  - §11: corrected "Six guards" -> "Seven guards" to match the seven
    items actually listed.
  - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
  - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
    + 30 parametrized) so the listing matches the actual run.

Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
  Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
  test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
  Windows-only environment limitations (cp1252 codec, `os.killpg`
  POSIX-only, missing `filterpy`) confirmed via `git stash` to be
  untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
  `tests/fault_tolerance/...` (pre-existing, separate component) and
  `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.

Co-authored-by: Cursor <cursoragent@cursor.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.

Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
  (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
  - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
    `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
    `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
    usage instructions.
  - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
    `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
  - 5 `scripts/inspect_*.py` cluster probes parameterized via
    `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
    rather than a hard-coded namespace.
  - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
    `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
    one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
  not committed" convention (matches `examples/deployments/powerplanner/
  .tmp-gp-minimal.yaml`) and to block the four common root-level
  personal-scratch files from sneaking back in via `git add .`.

dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
  `deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
  `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
  `envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
  `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
  cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
  `kubectl wait --for=jsonpath='{.status.state}'=successful` form
  in both §4 and the Checklist (removes the manual-`-w` divergence).

Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
  - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
    590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
    draft.` with last-validation date.
  - §3.1: added an explicit note that `aic_interpolation` and `mode`
    are pre-existing PlannerConfig fields (owned by
    `monitoring/aic_interpolation.py`) and intentionally absent from
    the Names Registry.
  - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
    it correctly points at the *throughput regression* case rather
    than reading as a blanket "config revert".
  - §6.5 pseudocode: replaced module-level `self.device_count` with
    `pynvml.nvmlDeviceGetCount()` (the original was syntactically
    incorrect outside the daemon class).
  - §13 Open Question #11: corrected the duplicated
    `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
    `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
    §5.3.
* powerplanner-testbed-design.md
  - §7: added a "Numeric-suffix convention" note explaining the
    intentional D21/E21 ID collision (IDs unique by filename + tuple,
    not renumbered).
  - §11: corrected "Six guards" -> "Seven guards" to match the seven
    items actually listed.
  - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
  - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
    + 30 parametrized) so the listing matches the actual run.

Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
  Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
  test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
  Windows-only environment limitations (cp1252 codec, `os.killpg`
  POSIX-only, missing `filterpy`) confirmed via `git stash` to be
  untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
  `tests/fault_tolerance/...` (pre-existing, separate component) and
  `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Pre-Phase-5 ("hardware validation" per powerplanner-design.md §11) housekeeping:
makes the dev environment shareable across teammates by removing personal
identifiers, parameterizing all dev-pod / probe references, and folding 11
review fixes into the three Phase 1-4 design documents.

Dev-env hardening
-----------------
* Personal namespace (`kaim-dynamo-system*`) and pinned cluster node ID
  (`aks-a100a-36888584-vmss000002`) removed from every dev-env asset:
  - Root-level `dev-pod.yaml`, `qwen3-quickstart-dgd.yaml`, and
    `Dockerfile.planner-dev` moved to `deploy/planner/dev/` with
    `${NS}` / `${DGD}` / `${DYN_NS}` envsubst placeholders and inline
    usage instructions.
  - Root-level `test_k8s_access.py` moved to `scripts/dev/` and reads
    `DYN_PARENT_DGD_K8S_NAMESPACE` (or `POD_NAMESPACE`) at runtime.
  - 5 `scripts/inspect_*.py` cluster probes parameterized via
    `DYN_PARENT_DGD_K8S_NAMESPACE`; failure mode is loud (SystemExit)
    rather than a hard-coded namespace.
  - `deploy/power_agent/dev-pod.yaml`: `nodeName` switched to
    `<GPU_NODE_NAME>` placeholder with a `kubectl get pods ...`
    one-liner showing how to discover the right node.
* `.gitignore` hardened to enforce the existing `.tmp-*` "intentionally
  not committed" convention (matches `examples/deployments/powerplanner/
  .tmp-gp-minimal.yaml`) and to block the four common root-level
  personal-scratch files from sneaking back in via `git add .`.

dpp-dev-env.md updates
----------------------
* All 10 path references rewritten to point at the new
  `deploy/planner/dev/` and `scripts/dev/` homes.
* New §5 ("Deploy the Dev Pod") subsection documenting the
  `${NS}` / `${DGD}` / `${DYN_NS}` placeholder workflow with both an
  `envsubst` (Linux/WSL) path and a Windows edit-in-place path.
* Quick Deploy Checklist filename corrected from the stale-DGDR
  `qwen3-quickstart.yaml` to `qwen3-quickstart-dgd.yaml`, plus a
  cross-reference to the One-Time Setup §4 warning.
* DGD-ready wait command standardized to the programmatic
  `kubectl wait --for=jsonpath='{.status.state}'=successful` form
  in both §4 and the Checklist (removes the manual-`-w` divergence).

Design-doc review pass (powerplanner-design.md, powerplanner-testbed-design.md)
-------------------------------------------------------------------------------
* powerplanner-design.md
  - Header `Status` flipped from `Draft` to `Validated (Phases 1-3 -
    590/4 cold + 86/1 testbed; see §9.0 / §9.0.1). Phases 4-5 still
    draft.` with last-validation date.
  - §3.1: added an explicit note that `aic_interpolation` and `mode`
    are pre-existing PlannerConfig fields (owned by
    `monitoring/aic_interpolation.py`) and intentionally absent from
    the Names Registry.
  - §5.7 / §6.7: reworded the cross-reference to failure mode #5 so
    it correctly points at the *throughput regression* case rather
    than reading as a blanket "config revert".
  - §6.5 pseudocode: replaced module-level `self.device_count` with
    `pynvml.nvmlDeviceGetCount()` (the original was syntactically
    incorrect outside the daemon class).
  - §13 Open Question #11: corrected the duplicated
    `scheduled_decode_kv_tokens` typo so the agg-mode gate now reads
    `scheduled_prefill_tokens + scheduled_decode_kv_tokens` matching
    §5.3.
* powerplanner-testbed-design.md
  - §7: added a "Numeric-suffix convention" note explaining the
    intentional D21/E21 ID collision (IDs unique by filename + tuple,
    not renumbered).
  - §11: corrected "Six guards" -> "Seven guards" to match the seven
    items actually listed.
  - §5.2: typo `MNB` -> `MNBT (max_num_batched_tokens)`.
  - §C.14: verbatim test-output `30 PASSED` -> `31 PASSED` (1 wrapper
    + 30 parametrized) so the listing matches the actual run.

Verification
------------
* Testbed (alpha + gamma): 82 passed, 5 skipped (matches the documented
  Windows baseline; gamma auto-skipped without the Rust mocker).
* AIC no-cluster integration (test_aic_power_optimizer.py +
  test_aic_power_e2e_sim.py): 49 passed.
* Unit tests: 456 passed; the 9 remaining failures are pre-existing
  Windows-only environment limitations (cp1252 codec, `os.killpg`
  POSIX-only, missing `filterpy`) confirmed via `git stash` to be
  untouched by this commit.
* All touched .py files: `python3.10 -m py_compile` clean.
* All touched .yaml files: `yaml.safe_load_all` clean.
* `ReadLints` over all 12 touched files: no errors.
* Final `rg "kaim|aks-a100a-36888584"` outside committed
  `tests/fault_tolerance/...` (pre-existing, separate component) and
  `examples/.../.tmp-gp-minimal.yaml` (now gitignored): zero hits.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
ryanolson added a commit that referenced this pull request May 20, 2026
…nly; extract executor helpers

#5: PreparedTransferPlan becomes a struct (Transform fields only); the
Direct variant + build_direct + annotated_layouts + the projection-memo
cache it backed are all removed. lookup_prepared_plan returns
Option<Arc<...>> with None for same-layout direct pairs; plan_and_lower
projects AnnotatedLayouts inline on that branch. The microsecond
projection-memo against millisecond-scale DMA/RDMA was never
measurable, and the Direct cache entry per axis_slice pattern was
churn for no gain.

#2 surgical: TransferManager::execute_transfer +
::execute_transfer_selection now share resolve_and_dissolve (handle
resolve + options dissolve + builder field threading) and
dispatch_resolved (executor call + metric wrap). Each public method
shrinks from ~80 lines to ~10. execute_planner_cuda_transfer +
execute_planner_nixl_transfer share a new planner_prelude helper
covering heterogeneous-reject + benchmark-lookup + prepared-lookup +
plan_and_lower — only the family validator + policy + post-outcome
dispatch differ per entry.

Net ~-220 lines. Verified: 743/743 lib tests pass; incompat-smoke.sh
6/6 PASS; cargo fmt + clippy clean on touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,

a struct that only exists in DCGM 4.x. With 3.x bindings the agent's

DcgmActuator.init() succeeded, opened the hostengine connection, and

ran NVML init cleanly — then crashed mid-first-reconcile with

'AttributeError: module dcgm_structs has no attribute

c_dcgmDeviceConfig_v2' after some GPUs were already capped. The

SIGTERM-restore path won't run when the actuator never finished

registering with _active_actuator, so the GPUs are left at custom caps.

Add a 7-line hasattr check immediately after the dcgm_structs import

in init() that raises a RuntimeError with:

  - the missing struct name (so the error is grep-able)

  - the required DCGM major (>=4.0)

  - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators

    know exactly what to bump)

Guards against accidental Dockerfile regressions to a 3.x base image.

Pair with the Dockerfile fix in 4820ca7 (which bumps the default

to 4.5.1-1-ubuntu22.04) so the default build path is consistent with

the runtime contract.

Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection

(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails

BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments

don't half-init and leave hostengine sockets dangling.

Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,

a struct that only exists in DCGM 4.x. With 3.x bindings the agent's

DcgmActuator.init() succeeded, opened the hostengine connection, and

ran NVML init cleanly — then crashed mid-first-reconcile with

'AttributeError: module dcgm_structs has no attribute

c_dcgmDeviceConfig_v2' after some GPUs were already capped. The

SIGTERM-restore path won't run when the actuator never finished

registering with _active_actuator, so the GPUs are left at custom caps.

Add a 7-line hasattr check immediately after the dcgm_structs import

in init() that raises a RuntimeError with:

  - the missing struct name (so the error is grep-able)

  - the required DCGM major (>=4.0)

  - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators

    know exactly what to bump)

Guards against accidental Dockerfile regressions to a 3.x base image.

Pair with the Dockerfile fix in 4820ca7 (which bumps the default

to 4.5.1-1-ubuntu22.04) so the default build path is consistent with

the runtime contract.

Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection

(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails

BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments

don't half-init and leave hostengine sockets dangling.

Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
krishung5 added a commit that referenced this pull request Jun 2, 2026
- decode_handler._extract_mm_hashes docstring now states 16-char hex
  for the SGLang path (was incorrectly claiming the vLLM 64-char shape;
  Devin #5).
- sglang launch banner drops the "Lightseek" prefix; uses "MM Exact
  Routing (SGLang)" to match the public name (Ryan suggestion #1).
- ModelRuntimeConfig.backend_framework field gains a doc-block on its
  motivation — frontend uses it for backend-specific routing hints
  (Ryan #3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request Jun 4, 2026
…eview ai-dynamo#5)

TrafficMetrics.{duration_s,num_req,isl,osl,kv_hit_rate} and
PredictionData.{predicted_num_req,predicted_isl,predicted_osl,
predicted_kv_hit_rate} were proto ``float`` (IEEE-754 32-bit), but their
source of truth (core/types.py + the Pydantic mirror) is Python float64.
Every out-of-process plugin received these values truncated to float32
(and re-truncated on the way back), so rate-style values drifted ~1e-4
and the gRPC transport disagreed bit-for-bit with the in-process
transport — which is presented as interchangeable. The round-trip tests
passed only because every fixture float was hand-picked float32-exact.

Change the nine fields to ``double`` / ``optional double``. Safe now:
the v1 contract is pre-ship (no external plugin has shipped against it),
and changing the wire type before release avoids a breaking change later.
Regenerated *_pb2 stubs (+ SPDX re-prepend). The Pydantic mirror already
uses float64, so no mirror change. New test asserts EXACT (not approx)
round-trip for non-float32-exact values, so a regression to ``float`` is
caught.

838 planner tests pass (+1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request Jun 4, 2026
…I runtime parity

The review-ai-dynamo#5 regen (a2b8c1c) was run with a local grpcio-tools 1.80
/ protobuf 6.33, baking ValidateProtobufRuntimeVersion(6,31,1) and
GRPC_GENERATED_VERSION '1.80.0' into the committed stubs. CI's pinned
runtime is protobuf 5.29.6 (requirements: protobuf>=5.29.5,<6.0dev),
so the 6.31.1 gencode guard hard-raised VersionError at import, erroring
out 5 planner test modules at collection.

Regenerate from the unchanged plugin.proto with grpcio-tools 1.67.1 —
the same toolchain the stubs originally shipped with (b39dde8) and
which passed CI — restoring gencode 5.27.2 + grpc guard 1.67.1. The
serialized descriptor blob is byte-identical (schema unchanged: doubles,
kv_hit_rate, WorkerState flags, component_name strip all preserved);
only the version guards revert to CI-compatible values.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
components/power_agent/actuator.py:807 uses dcgm_structs.c_dcgmDeviceConfig_v2,

a struct that only exists in DCGM 4.x. With 3.x bindings the agent's

DcgmActuator.init() succeeded, opened the hostengine connection, and

ran NVML init cleanly — then crashed mid-first-reconcile with

'AttributeError: module dcgm_structs has no attribute

c_dcgmDeviceConfig_v2' after some GPUs were already capped. The

SIGTERM-restore path won't run when the actuator never finished

registering with _active_actuator, so the GPUs are left at custom caps.

Add a 7-line hasattr check immediately after the dcgm_structs import

in init() that raises a RuntimeError with:

  - the missing struct name (so the error is grep-able)

  - the required DCGM major (>=4.0)

  - the DCGM_IMAGE build-arg + canonical 4.5.1 tag (so operators

    know exactly what to bump)

Guards against accidental Dockerfile regressions to a 3.x base image.

Pair with the Dockerfile fix in 4820ca7 (which bumps the default

to 4.5.1-1-ubuntu22.04) so the default build path is consistent with

the runtime contract.

Unit test covers both happy path (DCGM 4.x mock) and 3.x rejection

(MagicMock with c_dcgmDeviceConfig_v2 deleted), asserting init fails

BEFORE the first pydcgm.DcgmHandle call so misconfigured deployments

don't half-init and leave hostengine sockets dangling.

Refs: PR #9790 review, Power Agent live-test finding #5.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants