feat: nixl metadata store and retrieved from etcd by nnshah1 · Pull Request #6 · ai-dynamo/dynamo

nnshah1 · 2025-03-04T17:05:10Z

What does the PR do?

Adds ability to store and retrieve nixl metadata into etcd
Adds python bindings for runtime etcd client

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

grahamking · 2025-03-04T20:16:56Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

nnshah1 · 2025-03-04T20:23:18Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

ishandhanani · 2025-03-04T20:24:08Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

I think Ryan in the previous repo wanted to use the DistributedRuntime

grahamking · 2025-03-04T20:24:39Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers.

nnshah1 · 2025-03-04T20:25:12Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ...

to give it a thin abstraction and hide the etcd detail here -

nnshah1 · 2025-03-04T20:26:50Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers.

initially we used the plain python library - but direction was to reuse the rust implementation - agree - we will want to move it to already match with the lease for the component - and have the namespace applied automatically - so it'll be more of a KVStore in the namespace then an etcd client ...

grahamking · 2025-03-04T20:31:59Z

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ...

to give it a thin abstraction and hide the etcd detail here -

You mean like nim-nvllm has had since mid January? (update internal URL)

You could choose to do discovery with etcd or nats. --net-kv [nats|etc|<also urls>].

Ryan O dropped it in his furious weekend re-write that became the initial commit to triton distributed. I'm stuck in a time loop.

grahamking · 2025-03-04T20:34:50Z

We should maybe warn people not to rely on the etcd bindings if we're going to put them behind an abstraction eventually.

Co-authored-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local> Co-authored-by: Neelay Shah <neelays@ipp2-0493.ipp2u1.colossus.nvidia.com> Co-authored-by: Neelay Shah <neelays@ipp1-1941.ipp1a1.colossus.nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: Neelay Shah <neelays@4u8g-gen-0078.ipp3a2.colossus.nvidia.com> Co-authored-by: ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>

Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

Bug fixes: - #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache - #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout (bare-layout path was missing the writable-dir check present in HF_HOME test) Minor cleanups: - #4: Move `import pytest` from inside download_lora() to module top-level (lora_utils.py is test-only infra; pytest is always available) - #5: Replace pytestconfig.getoption("--models-dir") re-checks in predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR") (_models_dir_env runs first and sets the var; single source of truth) New coverage tests: - #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test verifying pytest_configure exits with returncode=2 on bad path - #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora() raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set Not addressed: #3 (keep gpu_0 per project guidelines and previous review retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>

…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

…ay (review ai-dynamo#6) The registration gateway receives external plugins' shared-secret ``RegisterRequest.auth_token``. ``start_gateway_server`` bound ``add_insecure_port`` unconditionally when no TLS creds were supplied, and the production caller never supplies creds and had no config to — so pointing ``gateway.listen`` at a TCP ``host:port`` silently stood up a plaintext gRPC server that received every plugin's token in cleartext, with only an INFO log. This is asymmetric with the OUTBOUND transport, which fails closed unless ``transport.allow_insecure_grpc=True``. Make the inbound side symmetric: - Add ``GatewayConfig.allow_insecure`` (default False). - ``start_gateway_server`` gains ``allow_insecure`` and, in the no-credentials branch, refuses to bind a non-``unix:`` (TCP) listen unless ``allow_insecure`` is set — raising a clear RuntimeError before any bind. ``unix:`` (Pod-local, trust-boundary) listens are always allowed. When a plaintext TCP bind IS opted into, it logs a WARNING (not INFO) naming the token-exposure risk. - ``_maybe_start_gateway`` passes ``gw_cfg.allow_insecure`` through. Tests: TCP + allow_insecure=False → RuntimeError "refusing to bind plaintext"; TCP + allow_insecure=True → binds (stubbed server). 837 planner tests pass (+2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three Dockerfile bugs combined to make the DCGM-mode image unbuildable on a fresh checkout. Fixing any one in isolation leaves the build broken, so they travel together: 1. DCGM_IMAGE default 'nvcr.io/nvidia/cloud-native/dcgm:4.2.3-2-ubuntu22.04' does not exist on NGC (verified 2026-05-21 via 'docker manifest inspect' → 404). Bump to 4.5.1-1-ubuntu22.04, the only resolvable 4.x tag. 2. DCGM 4.5+ relocated python bindings from /usr/local/dcgm/bindings/python3/ to /usr/share/datacenter-gpu-manager-4/bindings/python3/. The previous COPY would silently copy zero files under the new pin. Switch the source path to the 4.5+ location. 3. NGC's DCGM 4.5+ runtime image ships pydcgm with DcgmGroup.py:20 doing 'import logger' — but logger.py lives in DCGM's source tree under testing/python3/ and is NOT packaged. Without a shim every DcgmGroup construction raises ModuleNotFoundError. Add a 10-line stdlib-logging adapter at components/power_agent/logger.py and COPY it into /opt/dcgm/python/logger.py during the runtime stage. This unblocks 'docker build -f components/power_agent/Dockerfile' on a fresh clone (verified locally via 'docker buildx build --build-arg DCGM_IMAGE=...4.5.1-1-ubuntu22.04' against viking-prod-216 on 2026-05-21, image pushed to ttl.sh/dynamo-pa-kaim-dcgm45-v2:24h and used by the Path-B live test on aks-a100b-22138447-vmss000000). Refs: PR #9790 review, Power Agent live-test findings #1/#2/#6. Signed-off-by: Kai Ma <kaim@nvidia.com>

tedzhouhk and others added 30 commits February 21, 2025 21:56

init example

fbc3651

add nixl to dockerfile.vllm

9299991

add nixl torch example

7206489

wip vllm with nixl

d7c607d

first working nixl conditional prefill

80dfe9e

add readme

ef012e0

use callback for remote prefill req

3094654

wip tp > 1

6078cdf

add nixl metadta struct

88f3d87

update readme with tp > 1

6f7e4ce

decode run with MQLLMEngine

be6588d

decode on triton

21f33ca

triton dummy prefill

00ffc97

triton prefill

41a16a2

update readme

cda0077

remove nixl torch example

1971de5

update todos

6eb260d

update todos

85ef7f5

add http endpoint

e8434f6

remove remote prefill response

81ab078

update todos

899cc75

exchange metadta over fs

be2682d

do not restrict mem

c0e2357

update readme

3b42c68

init example

d80e1d6

add nixl to dockerfile.vllm

fb357ca

add nixl torch example

e51ff32

wip vllm with nixl

d6fba17

first working nixl conditional prefill

f959b01

add readme

0ff5e41

ptarasiewiczNV approved these changes Mar 4, 2025

View reviewed changes

grahamking approved these changes Mar 4, 2025

View reviewed changes

nnshah1 requested a review from statiraju March 4, 2025 20:47

nnshah1 merged commit 57d19b5 into main Mar 4, 2025

nnshah1 deleted the nnshah1-vllm-nixl-etcd branch March 4, 2025 21:10

dagil-nvidia mentioned this pull request Oct 21, 2025

feat: Add automated dependency version tracking and extraction #3547

Closed

coderabbitai Bot mentioned this pull request Oct 22, 2025

docs: address Harry/VDR feedback + fixing broken links across repository #3802

Merged

coderabbitai Bot mentioned this pull request Nov 5, 2025

feat: Update Dynamo k8s deployment example to use ModelExpress #4112

Closed

nv-tusharma mentioned this pull request Feb 12, 2026

fix: add trtllm build dependency for fault tolerance tests #6140

Closed

MatejKosec mentioned this pull request Feb 25, 2026

fix(responses): accept assistant output_text messages without id/status in input #6599

Merged

3 tasks

tanmayv25 mentioned this pull request Apr 15, 2026

DEP: Backend Interface -- LLMEngine ABC and Worker #8251

Open

atchernych mentioned this pull request Apr 23, 2026

Proposal: Hybrid Priority Scheduling in Dynamo #8580

Open

keivenchang mentioned this pull request Apr 24, 2026

[INVESTIGATE] sglang + DSv4: kwarg workaround breaks live — needs diagnosis #8671

Closed

4 tasks

dmitry-tokarev-nv mentioned this pull request May 26, 2026

fix(trtllm): restore deps, default CMD, and reduce layer count after upstream base switch #9889

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: nixl metadata store and retrieved from etcd#6

feat: nixl metadata store and retrieved from etcd#6
nnshah1 merged 110 commits into
mainfrom
nnshah1-vllm-nixl-etcd

nnshah1 commented Mar 4, 2025

Uh oh!

grahamking commented Mar 4, 2025

Uh oh!

nnshah1 commented Mar 4, 2025

Uh oh!

ishandhanani commented Mar 4, 2025

Uh oh!

grahamking commented Mar 4, 2025

Uh oh!

nnshah1 commented Mar 4, 2025

Uh oh!

nnshah1 commented Mar 4, 2025

Uh oh!

grahamking commented Mar 4, 2025 •

edited

Loading

Uh oh!

grahamking commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nnshah1 commented Mar 4, 2025

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

grahamking commented Mar 4, 2025

Uh oh!

nnshah1 commented Mar 4, 2025

Uh oh!

ishandhanani commented Mar 4, 2025

Uh oh!

grahamking commented Mar 4, 2025

Uh oh!

nnshah1 commented Mar 4, 2025

Uh oh!

nnshah1 commented Mar 4, 2025

Uh oh!

grahamking commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grahamking commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

grahamking commented Mar 4, 2025 •

edited

Loading