Skip to content

ci: update container registry location#3

Merged
saturley-hall merged 16 commits into
mainfrom
harrison/ci-test
Mar 4, 2025
Merged

ci: update container registry location#3
saturley-hall merged 16 commits into
mainfrom
harrison/ci-test

Conversation

@saturley-hall

@saturley-hall saturley-hall commented Mar 4, 2025

Copy link
Copy Markdown
Member

What does the PR do?

Updates the container registry used for caching

Splits CodeQL and Copyright-Checks into separate workflows.

Update image that Copyright checks uses to a new location in GHCR.io

@github-actions

github-actions Bot commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Test Results

 2 files   2 suites   25s ⏱️
71 tests 71 ✅ 0 💤 0 ❌
89 runs  88 ✅ 1 💤 0 ❌

Results for commit c7118b9.

♻️ This comment has been updated with latest results.

tedzhouhk added a commit that referenced this pull request Apr 18, 2026
Adds TestQwen235BugReproCase with the exact two picks from the original
triage report:

* Prefill: tp=4, dp=1, moe_tp=1, moe_ep=4
* Decode: tp=1, dp=8, moe_tp=2, moe_ep=4

Each test pins down a specific symptom that the old profiler path
produced:

* test_prefill_pick_satisfies_aic_identity — Bug #1 (missing moe kwargs)
  and Bug #2 (using .tp_size property) would break the identity for
  this pick. Verifies all five kwargs and the MoE identity.
* test_decode_pick_satisfies_aic_identity — same for the decode pick,
  which was the exact shape that triggered Bug #3 (missing
  attention_dp_size).
* test_tp_size_property_differs_from_aic_tp_size — direct regression
  guard: documents that the property returns 1 here while AIC's tp_size
  is 4. Anyone who wires .tp_size back into AIC kwargs will break this.
* test_end_to_end_sweep_delivers_complete_kwargs_to_aic — runs both
  sweeps and asserts every AIC call gets complete kwargs. This is the
  direct reproducer of the originally-reported failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 4, 2026
…code-waiter cleanup

Addresses two correctness issues raised by dynamo-ops review on PR 8965.

1. **Race on PrefillReady re-insert** (model_manager.rs)

   Both `register_prefill_router` and `activate_prefill_router` previously
   used a `remove → process → insert` pattern on
   `prefill_router_activators`. Between the `remove` and the `insert` the
   entry is gone from the map, so a concurrent `remove_prefill_activator`
   call (called by the watcher on prefill-component teardown) hits the
   empty map, skips its cleanup, and we then re-insert a stale
   `PrefillReady` for a prefill that's already gone. The next decode
   rebuild's lookup finds the resurrected cache and activates a
   `PrefillRouter` against a dead endpoint.

   Refactor both functions to use DashMap's `Entry` API. The shard lock is
   held for the duration of the OccupiedEntry, so any concurrent
   `remove_prefill_activator` serializes after us and observes the entry
   it needs to clear. The PrefillReady-consume path in
   `register_prefill_router` now reads-and-clones in place without
   removing; the DecodeWaiting/PrefillReady → PrefillReady transition in
   `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an
   atomic value swap that returns the old value.

2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs)

   `handle_delete_helper`'s decode branch only called
   `remove_decode_prefill_waiter` when `remove_worker_set` returned
   `Some(_)`. If decode registered (creating a `DecodeWaiting` activator
   entry via `register_prefill_router`) but `handle_add_helper` failed
   later (e.g., on `kv_chooser_for`, monitor setup, or
   `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever
   landed in the manager — and on later teardown
   `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting`
   entry orphaned in the activator map.

   Drop the `removed.is_some()` guard. The helper itself is state-safe
   (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so
   calling it on a key that's vacant or holds `PrefillReady` is a no-op.
   The prefill-teardown branch keeps its `removed.is_some()` guard
   because `remove_prefill_activator` blanket-removes both states, and
   calling it without a real WorkerSet teardown could orphan a live
   decode-side `DecodeWaiting`.

Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary
OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting,
ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
tedzhouhk added a commit that referenced this pull request May 11, 2026
…rop shebang

Three CI failures on PR #9360 addressed:

1. ``check-shebang-scripts-are-executable`` (pre-commit):
   ``tests/vllm/test_self_benchmark_gpu.py`` carried a ``#!/usr/bin/env
   python3`` shebang but isn't intended to be directly executed (pytest
   collects it). Drop the shebang to satisfy the hook without granting
   exec bits.

2. ``pytest-marker-report`` (pre-commit) -- 9 collection errors:
   ``tests/vllm/__init__.py`` makes ``tests/vllm`` a Python package
   that shadows the real ``vllm`` distribution when ``tests`` is on
   sys.path during marker discovery. Every other vLLM-suite test that
   does ``from vllm import LLM`` (or similar) broke with
   ``ImportError: cannot import name 'LLM' from 'vllm'`` pointing at
   our ``tests/vllm/__init__.py``. Renamed to
   ``tests/vllm_self_benchmark/`` to remove the name collision; all 9
   shadowed imports resolve to the real ``vllm`` again.

3. ``vllm-runtime / Multi-GPU Test`` -- disagg test failed with:
       (EngineCore pid=...) zmq.error.ZMQError: Address already in use
                                  (addr='tcp://*:20380')
   ``InstrumentedScheduler`` (auto-injected by ``--benchmark-mode``)
   binds the FPM publisher to ``tcp://*:DYN_FORWARDPASS_METRIC_PORT +
   dp_rank``, defaulting to 20380. Two workers on the same host (the
   disagg prefill + decode pair) both try to bind 20380 and the second
   loses. The operator works around this by injecting
   ``DYN_FORWARDPASS_METRIC_PORT`` per engine
   (deploy/operator/internal/dynamo/component_worker.go:111); the
   single-GPU agg test happens to avoid the collision because only
   one worker runs. Apply the same fix in the test: allocate a
   per-worker port via ``allocate_port(20380)`` and inject it as the
   env var, with matching ``deallocate_port`` on teardown.

   (The cancellation/migration tests don't hit this because they
   don't pass ``--benchmark-mode`` and so don't get
   InstrumentedScheduler.)

Verified locally:
* ``import vllm`` resolves to the real site-packages vllm again.
* pytest collection of the renamed test passes; marker selectors
  ``pre_merge and vllm and gpu_1`` (agg test) and
  ``pre_merge and vllm and (gpu_2 or gpu_4)`` (disagg test) still
  match -- so both stay wired to ``pr.yaml``'s vllm-test and
  vllm-multi-gpu-test GPU stages and continue to gate the PR.

The single-GPU agg test passed in the prior CI run (only the disagg
test failed) so issue #3 is specifically what was needed for disagg
to pass.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).

The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:

  - production mode with rbac.namespaceRestricted=true
  - dev mode (forces effective=true unless overridden)

produce a DaemonSet whose argv matches the RBAC scope its token holds.

Verified by helm template against both modes:
  - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
  - true            → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
                      via downward API, Role + RoleBinding
  - helm lint passes in both modes

Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).

The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:

  - production mode with rbac.namespaceRestricted=true
  - dev mode (forces effective=true unless overridden)

produce a DaemonSet whose argv matches the RBAC scope its token holds.

Verified by helm template against both modes:
  - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
  - true            → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
                      via downward API, Role + RoleBinding
  - helm lint passes in both modes

Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
krishung5 added a commit that referenced this pull request Jun 2, 2026
- decode_handler._extract_mm_hashes docstring now states 16-char hex
  for the SGLang path (was incorrectly claiming the vLLM 64-char shape;
  Devin #5).
- sglang launch banner drops the "Lightseek" prefix; uses "MM Exact
  Routing (SGLang)" to match the public name (Ryan suggestion #1).
- ModelRuntimeConfig.backend_framework field gains a doc-block on its
  motivation — frontend uses it for backend-specific routing hints
  (Ryan #3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request Jun 3, 2026
When rbac.namespaceRestricted=true the chart renders a namespace-scoped
Role + RoleBinding (templates/role.yaml + rolebinding.yaml) instead of
the cluster-scoped pair, but the DaemonSet's container command never
passes --namespace to the agent. power_agent.py:438-446 therefore takes
its else branch — list_pod_for_all_namespaces — which the namespace-only
token rejects with 403 Forbidden on every reconcile. Reconciles never
advance and no caps are ever applied (silent failure: the agent doesn't
exit; pods are just never enumerated).

The --namespace CLI flag itself already exists (power_agent.py:542) and
the list_namespaced_pod branch (:439-442) works correctly — the dev-pod
template (templates/dev-pod.yaml:57) has been wiring this all along via
--namespace=$POD_NAMESPACE. This commit extends the same downward-API
POD_NAMESPACE env-var pattern to the production DaemonSet, gated on the
existing power-agent.effectiveNamespaceRestricted helper so both:

  - production mode with rbac.namespaceRestricted=true
  - dev mode (forces effective=true unless overridden)

produce a DaemonSet whose argv matches the RBAC scope its token holds.

Verified by helm template against both modes:
  - default (false) → no --namespace, no POD_NAMESPACE, ClusterRole
  - true            → --namespace=$(POD_NAMESPACE), POD_NAMESPACE env
                      via downward API, Role + RoleBinding
  - helm lint passes in both modes

Refs: PR #9682 review, Power Agent live-test finding #3.
Signed-off-by: Kai Ma <kaim@nvidia.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kangclzjc referenced this pull request in kangclzjc/dynamo Jun 4, 2026
… path (review #3)

On the orchestrator replay path (use_orchestrator=True) the regression
models were never installed: replay/main.py guarded the AIC benchmark-FPM
bootstrap behind ``adapter._sm is not None and not adapter._sm._is_easy``,
and ``_sm`` is None under use_orchestrator — so the whole block (incl. the
``load_benchmark_fpms`` calls) was skipped. ``get_regression`` then returned
None for the whole replay, the throughput regression stayed empty, and
orchestrator-replay scaling decisions diverged from PSM — contradicting the
adapter docstrings that claimed bootstrap_from_fpms→install_regressions was
wired.

Fix:
- engine_adapter: extract ``install_regressions_from_fpms`` (synchronous,
  builds the regressions from benchmark FPMs via the throwaway-PSM factory
  and installs them on the shared store — no plugin bootstrap).
  ``bootstrap_from_fpms`` now = install_regressions_from_fpms +
  bootstrap_plugins.
- replay_adapter: add path-agnostic ``install_benchmark_fpms`` — PSM path →
  ``PlannerStateMachine.load_benchmark_fpms``; orchestrator path →
  ``install_regressions_from_fpms`` (plugins were already bootstrapped at
  adapter construction, so this does NOT double-bootstrap). Corrected the
  ``_get_regression`` docstring to the real wiring.
- replay/main.py: guard is now ``not adapter._is_easy_mode()`` (path-
  agnostic) and the two ``adapter._sm.load_benchmark_fpms(...)`` calls become
  ``adapter.install_benchmark_fpms(...)``.

Test: install_benchmark_fpms on the orchestrator path makes
get_regression("agg") non-None (was None pre-fix).

840 planner tests pass (+1). NOTE: replay/main.py's end-to-end path can't be
exercised in this env (its Rust _core symbol predates the installed
extension); the edit is py_compile-validated and the adapter method it calls
is unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jthomson04 added a commit that referenced this pull request Jun 4, 2026
The cache-realloc (#1) and metric-handle (#2/#3) commits left a few lines
unformatted — `longest_prefix_match`'s collapsed signature and the `merged`
binding in l1.rs, and the per-worker gauge assignment + a test assert in
metrics.rs. `cargo fmt --all --check` (run by the rust-tests CI job) flagged
them, failing the job. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
zhongdaor-nv added a commit that referenced this pull request Jun 8, 2026
During each cell's measurement window, take a few lightweight per-node
snapshots (background thread, foreground sub.wait() unperturbed) capturing the
event-plane's resource footprint so we can quantify the brokerless-ZMQ vs NATS
cost trade-off:

  - established TCP count via `ss -tn state established` (ZMQ direct mesh is
    O(p*s); NATS is O(p+s) to the broker) -- the headline connection-cost diff.
  - per-process CPU% / RSS / open-fd / nproc for the key classes
    (mocker, event_plane_bench_sub, nats-server, frontend, zmq_broker, loadgen)
    via /proc/<pid>/{stat,status} + ls /proc/<pid>/fd.
  - nats-server broker cost: active during NATS cells vs ~idle during ZMQ
    cells (ZMQ has no broker -- the saving to quantify).

New ressample.py is a stdlib-only host-level sampler (runs on the node HOST,
not inside pyxis, so its ss/`/proc` scan sees every enroot process + all node
TCP). bench.py fires it on every node via a new launcher.popen_host() (srun
with no --container-* flags), collects the echoed JSON snapshots, folds them
into compact summary stats, writes a per-cell *_res.json, and attaches the
summary to every summary.json row. Gated behind DYN_BENCH_RESSAMPLE (default
on); snapshot count via DYN_BENCH_RESSAMPLE_N (default 2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zhongdaor-nv added a commit that referenced this pull request Jun 8, 2026
The sampler snapshots are read back from the srun-streamed stdout logs, not a
follow-up `cat` srun; correct the docstring to match the implementation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zhongdaor-nv added a commit that referenced this pull request Jun 8, 2026
#3)

Three fixes found in the lego-c2 sanity run:

1. RSS read 0 for in-container processes. Viewed from the node host, enroot/
   pyxis processes report VmRSS=0 in /proc/<pid>/status (and 0 resident in
   statm); only /proc/<pid>/smaps_rollup `Rss:` reports the true resident set.
   Read smaps_rollup first, fall back to VmRSS for host procs.

2. nproc inflation. The `srun` client + slurmstepd that launch nats-server /
   event_plane_bench_sub carry the full `-- nats-server ...` command in their
   argv, so the cmdline-substring match counted them too (nats nproc=3 instead
   of 1, with 0 RSS dragging the sum). Skip launcher/wrapper cmdlines
   (srun/slurmstepd/bash -c/` -- `/the sampler itself) so each class counts
   only the real target process.

3. Worker-node samplers crashed with FileNotFoundError: the node0-created
   LOGDIR doesn't exist on worker nodes, so `open(--out)` died before the
   snapshot was emitted -> only node0 reported, losing the cross-node
   connection counts. Now echo the JSON to stdout FIRST (the orchestrator
   reads snapshots from the srun-streamed stdout, not the --out file), then
   makedirs + best-effort write the node-local --out copy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants