Skip to content

feat: nixl metadata store and retrieved from etcd#6

Merged
nnshah1 merged 110 commits into
mainfrom
nnshah1-vllm-nixl-etcd
Mar 4, 2025
Merged

feat: nixl metadata store and retrieved from etcd#6
nnshah1 merged 110 commits into
mainfrom
nnshah1-vllm-nixl-etcd

Conversation

@nnshah1

@nnshah1 nnshah1 commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

What does the PR do?

  • Adds ability to store and retrieve nixl metadata into etcd
  • Adds python bindings for runtime etcd client

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@grahamking

Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

@nnshah1

nnshah1 commented Mar 4, 2025

Copy link
Copy Markdown
Contributor Author

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

@ishandhanani

Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

I think Ryan in the previous repo wanted to use the DistributedRuntime

@grahamking

Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers.

@nnshah1

nnshah1 commented Mar 4, 2025

Copy link
Copy Markdown
Contributor Author

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ...

to give it a thin abstraction and hide the etcd detail here -

@nnshah1

nnshah1 commented Mar 4, 2025

Copy link
Copy Markdown
Contributor Author

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers.

initially we used the plain python library - but direction was to reuse the rust implementation - agree - we will want to move it to already match with the lease for the component - and have the namespace applied automatically - so it'll be more of a KVStore in the namespace then an etcd client ...

@grahamking

grahamking commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ...

to give it a thin abstraction and hide the etcd detail here -

You mean like nim-nvllm has had since mid January? (update internal URL)

You could choose to do discovery with etcd or nats. --net-kv [nats|etc|<also urls>].

Ryan O dropped it in his furious weekend re-write that became the initial commit to triton distributed. I'm stuck in a time loop.

@grahamking

Copy link
Copy Markdown
Contributor

We should maybe warn people not to rely on the etcd bindings if we're going to put them behind an abstraction eventually.

@nnshah1 nnshah1 requested a review from statiraju March 4, 2025 20:47
@nnshah1 nnshah1 merged commit 57d19b5 into main Mar 4, 2025
@nnshah1 nnshah1 deleted the nnshah1-vllm-nixl-etcd branch March 4, 2025 21:10
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
Co-authored-by: Neelay Shah <neelays@ipp2-0493.ipp2u1.colossus.nvidia.com>
Co-authored-by: Neelay Shah <neelays@ipp1-1941.ipp1a1.colossus.nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: Neelay Shah <neelays@4u8g-gen-0078.ipp3a2.colossus.nvidia.com>
Co-authored-by: ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 22, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 25, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
kangclzjc added a commit to kangclzjc/dynamo that referenced this pull request Jun 4, 2026
…ay (review ai-dynamo#6)

The registration gateway receives external plugins' shared-secret
``RegisterRequest.auth_token``. ``start_gateway_server`` bound
``add_insecure_port`` unconditionally when no TLS creds were supplied,
and the production caller never supplies creds and had no config to —
so pointing ``gateway.listen`` at a TCP ``host:port`` silently stood up
a plaintext gRPC server that received every plugin's token in cleartext,
with only an INFO log. This is asymmetric with the OUTBOUND transport,
which fails closed unless ``transport.allow_insecure_grpc=True``.

Make the inbound side symmetric:
- Add ``GatewayConfig.allow_insecure`` (default False).
- ``start_gateway_server`` gains ``allow_insecure`` and, in the
  no-credentials branch, refuses to bind a non-``unix:`` (TCP) listen
  unless ``allow_insecure`` is set — raising a clear RuntimeError before
  any bind. ``unix:`` (Pod-local, trust-boundary) listens are always
  allowed. When a plaintext TCP bind IS opted into, it logs a WARNING
  (not INFO) naming the token-exposure risk.
- ``_maybe_start_gateway`` passes ``gw_cfg.allow_insecure`` through.

Tests: TCP + allow_insecure=False → RuntimeError "refusing to bind
plaintext"; TCP + allow_insecure=True → binds (stubbed server).

837 planner tests pass (+2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request Jun 8, 2026
Three Dockerfile bugs combined to make the DCGM-mode image unbuildable

on a fresh checkout. Fixing any one in isolation leaves the build broken,

so they travel together:

1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04'

   does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect'

   → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag.

2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/

   to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous

   COPY would silently copy zero files under the new pin. Switch the source

   path to the 4.5+ location.

3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing

   'import logger' — but logger.py lives in DCGM's source tree under

   testing/python3/ and is NOT packaged. Without a shim every DcgmGroup

   construction raises ModuleNotFoundError. Add a 10-line stdlib-logging

   adapter at components/power_agent/logger.py and COPY it into

   /opt/dcgm/python/logger.py during the runtime stage.

This unblocks 'docker build -f components/power_agent/Dockerfile' on a

fresh clone (verified locally via 'docker buildx build --build-arg

DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21,

image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the

Path-B live test on aks-a100b-22138447-vmss000000).

Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6.
Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants