Skip to content

feat: Worker self-hosted metadata files#8754

Draft
nnshah1 wants to merge 42 commits into
mainfrom
nnshah1/gh-8749-self-hosted-metadata
Draft

feat: Worker self-hosted metadata files#8754
nnshah1 wants to merge 42 commits into
mainfrom
nnshah1/gh-8749-self-hosted-metadata

Conversation

@nnshah1

@nnshah1 nnshah1 commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Implements #8749.

What this enables

A worker can opt in to self-hosting the metadata files (config.json, tokenizer.json / tokenizer.model, tokenizer_config.json, chat_template.{jinja,json}, generation_config.json) the frontend needs for preprocessing. The MDC's existing typed-enum CheckedFile slots become URI carriers — every slot already had a blake3 checksum and a path: Either<PathBuf, Url> capable of holding any URI scheme. The only new field is an additive CheckedFile.size: Option<u64> that doubles as the per-file fetch cap and the new/legacy discriminant.

Self-host is on by default. When the worker has a system_status_server running (i.e. DYN_SYSTEM_PORT is set), each metadata file's CheckedFile.path is rewritten to the worker's /v1/metadata/... route. Workers without a status server (no DYN_SYSTEM_PORT) gracefully skip the rewrite and keep whatever URI is naturally there (hf:// after origin's move_to_url, or a local file://) — frontends still resolve those through the same verify-and-cache pipeline.

Workers opt out via:

  • register_model(..., self_host_metadata=False) (Python kwarg), or
  • DYN_SELF_HOST_METADATA=false (or 0 / no / off) in the worker's environment.

State machine — worker side

  1. Worker process starts. DistributedRuntime::new spawns the system_status_server when DYN_SYSTEM_PORT is set. The server unconditionally mounts GET /v1/metadata/{model_slug}/{*filename} — the handler reads from a process-local MetadataArtifactRegistry on the runtime; an empty registry returns 404.
  2. Worker downloads / mounts the model. Existing path. Files end up either on local disk (HF cache, PVC mount, etc.) or are referenced via URL.
  3. Worker calls register_model(...). LocalModelBuilder::build() produces an MDC. Each populated typed-enum slot's CheckedFile is built via CheckedFile::from_disk(...), which mmap-hashes the bytes and populates (checksum, size) in one pass. Every populated slot now carries size = Some(N), so is_new_format() returns true.
  4. LocalModel::attach() fires.
    • If self_host_metadata=True (default) AND system_status_server is running: move_to_self_host(base, drt) walks every populated typed-enum slot via iter_metadata_files_mut(), derives the filename from the existing CheckedFile.path, and calls cf.move_to_url(<base>/v1/metadata/<slug>/<filename>) — the same helper origin already uses for hf:// rewriting. It also registers (slug, filename) → on-disk path in drt.metadata_artifacts(). The worker does not stage or copy bytes; files stay where they are.
    • If self_host_metadata=True (default) but DYN_SYSTEM_PORT is unset (no system_status_server): the rewrite is silently skipped with an info log; CheckedFiles continue with whatever URI is naturally there (hf:// after the fall-through below, or local file://).
    • For HF-sourced models where source_path doesn't exist locally, origin's existing move_to_url("hf://...") rewrites the typed CheckedFiles to hf://<repo>/<filename>. Unchanged from origin.
  5. MDC published into discovery. Discovery is keyed by mdcsum() which hashes only the typed-enum checksums, never URIs — multiple replicas of the same model share a workerset regardless of which URI scheme each worker advertises.

State machine — frontend side

  1. Frontend's ModelWatcher observes a new MDC.
  2. Workerset compatibility check. If the model already has a workerset, is_checksum_compatible(ws_key, mdcsum) rejects mismatched workers (loud error in logs, no registration). Same blake3s for the typed artifacts → same mdcsum() → same workerset.
  3. download_config runs. Order: (a) if is_new_format() (every populated slot has size = Some), route through resolve_metadata_files — every scheme rechecks blake3 against the staged bytes. (b) Otherwise legacy: short-circuit on has_local_files() for shared mounts. (c) Otherwise fall back to hub::from_hf(self.source_path()). The new format always goes through verify-and-cache, regardless of whether the URI is http://, hf://, or a synthesized file://.
  4. resolve_metadata_files resolves each populated slot. For each CheckedFile:
    • Derive the URI: cf.url() if the worker rewrote it via move_to_url/move_to_self_host; otherwise synthesize file://<canonicalized> from the local PathBuf (shared-mount case).
    • Compute blob = ~/.cache/dynamo/mdc/blobs/<blake3-hex>.
    • Acquire flock(LOCK_EX) on <blob>.lock. Linux's per-open-file-description lock semantics serialize both intra-process tasks and across processes, so a single layer covers both. Multiple frontends sharing $HOME collapse concurrent fetches to a single download per blake3.
    • Double-check blob.exists() after acquiring the lock (a peer may have populated it while we waited); short-circuit on cache hit.
    • Stage scheme-specifically into a per-call tmp under expected.size() (with a 1 GiB absolute ceiling on the declared value): http(s) streams via tokio_util::io::StreamReader + tokio::io::copy; file:// and hf:// go through a sized tokio::fs::copy.
    • Verify by rebuilding CheckedFile::from_disk(tmp) and comparing to the expected checksum — same hashing path the worker used at registration, so the comparison is bit-identical.
    • Atomic-rename tmp → blob via the shared stage_and_rename primitive.
    • symlink_force(blob, by-slug/<slug>/<mdcsum>/<filename>) — same atomic-rename primitive, so concurrent frontends don't collide on the symlink either.
  5. update_dir(by-slug/<slug>/<mdcsum>) points the typed CheckedFiles at the per-(slug, mdcsum) symlink directory. The <mdcsum> segment mirrors HF Hub's snapshots/<rev>/ and isolates worker sets that share a model name but publish different file content. Downstream consumers (tokenizer(), prompt_formatter(), ...) load from local paths.
  6. Model lands on /v1/models. Tokenizer is in memory; bytes don't need disk again.

Key design points

One additive field CheckedFile.size: Option<u64> with #[serde(default, skip_serializing_if = "Option::is_none")]. No Vec<ArtifactRef>, no ArtifactRole enum, no parallel checksum store. The typed-enum slots are the list; CheckedFile.checksum is the identity. Drift between sources of truth is structurally impossible.
size doubles as discriminant is_new_format() returns true iff every populated slot carries size = Some. Pre-PR MDCs deserialize cleanly with size = None and fall through to legacy hub::from_hf. Mixed states (some slots with size, some without) count as legacy — surfacing a likely worker bug rather than papering over it.
Cache layout HF-Hub-style content-addressed: ~/.cache/dynamo/mdc/blobs/<blake3-hex> + ~/.cache/dynamo/mdc/by-slug/<slug>/<mdcsum>/<filename> symlinks. Per-(slug, mdcsum) isolation handles worker-set churn. Cleanup is user-managed (same posture as HF Hub).
Concurrency flock(LOCK_EX) on <blob>.lock covers both intra- and inter-process. Atomic publish via stage_and_rename (per-call tmp + rename(2)) for both blob writes and symlink creation.
Verification Every fetched artifact rebuilt via CheckedFile::from_disk(tmp) and compared to expected.checksum(). Hard-fail on mismatch (no silent fallback to HF once the new path is engaged).
HTTP body bound Per-file cap from expected.size(), plus a 1 GiB absolute ceiling at resolve_uri entry to clamp the trust boundary. Bodies are streamed (no Vec<u8> buffering); pre-checked via Content-Length when present.
Worker URL discovery system_host from existing DYN_SYSTEM_HOST; wildcard binds (0.0.0.0/::) resolved via dynamo_runtime::utils::ip_resolver::get_local_ip_for_advertise() — the shared resolver used by the TCP request plane (IPv4 / IPv6 / URL bracketing handled uniformly). No bespoke implementation. No new env var.
Frontend system_status_server Operator does not set DYN_SYSTEM_PORT on frontend pods, so the route never spawns there.
Default-on with graceful skip Self-host is on by default (DYN_SELF_HOST_METADATA defaults to truthy; falsy values 0 / false / no / off opt out). Workers without a system_status_server (no DYN_SYSTEM_PORT) silently skip the http rewrite and keep their natural URI scheme — preserves out-of-the-box behavior for the 90+ sample DGDs that don't set DYN_SYSTEM_PORT.

Tests

  • Unit (lib/llm/src/common/checked_file.rs#tests): CheckedFile round-trip — legacy bytes deserialize with size = None; new bytes carry size = Some(N).
  • Unit (lib/llm/src/model_card.rs#tests): is_new_format / iter_metadata_files correctness; resolve_uri blake3 mismatch (mockito); parse_hf_uri cases; HTTP body oversize rejection (Content-Length pre-check + post-write overage).
  • Unit (lib/llm/src/local_model.rs#tests): DYN_SELF_HOST_METADATA parsing — default-on when unset, truthy values keep the default ON, falsy values (0 / false / no / off / empty) opt out.
  • Unit (lib/runtime/src/metadata_registry.rs#tests): registry contract — register/get/unregister/multi-model/clone-shares-state.
  • Integration (lib/llm/tests/model_card.rs): download_config resolves through cf.url() for file://, hf://, http://; blake3-mismatch rejection; legacy fallback when is_new_format() == false; per-test IsolatedHome RAII guard so the cache leaves no residue. Plus a #[cfg(feature = "integration")] worker+frontend round-trip via real DistributedRuntime + ModelWatcher + worker attach(), using a unique etcd namespace (<pid>-<uuid>) + random port so parallel cargo test invocations don't collide.
  • Local e2e (tests/frontend/test_mocker_self_host_multi_replica.py): CPU-only mocker workers parametrized over the URI scheme — self_host (http://), hf (hf://), and file (file://, via a local snapshot dir). Two replicas exercise concurrent registrations against an isolated HOME=<tmp_path>. A separate test pairs two DynamoFrontendProcess instances sharing one $HOME to exercise cross-process flock on the metadata cache. Marked gpu_0/e2e/post_merge.

Files changed

  • lib/llm/src/common/checked_file.rs — new size: Option<u64> field + accessor; from_disk populates it; serde additive.
  • lib/llm/src/model_card.rsis_new_format, iter_metadata_files / iter_metadata_files_mut, download_config reorder, resolve_metadata_files, resolve_uri, BlobLock, stage_and_rename / stage_and_rename_sync, symlink_force, scheme-specific stage helpers (stream_to_tmp, copy_to_tmp).
  • lib/llm/src/local_model.rsself_host_metadata field + setter, move_to_self_host (gracefully skips when system_status_server isn't running), self_host_base_url returning Option<String> (using shared get_local_ip_for_advertise), DYN_SELF_HOST_METADATA env-var-driven default (defaults to ON; falsy values opt out).
  • lib/llm/Cargo.toml — adds libc dependency for flock.
  • lib/runtime/src/metadata_registry.rs (new) — process-local (slug, filename) → PathBuf registry.
  • lib/runtime/src/distributed.rs — registry field + accessor on DistributedRuntime.
  • lib/runtime/src/system_status_server.rs — mounts /v1/metadata/{slug}/{*filename} route + handler.
  • lib/bindings/python/rust/lib.rsself_host_metadata kwarg on register_model.
  • lib/llm/tests/model_card.rs — integration tests + DRT-level integration test.
  • tests/frontend/test_mocker_self_host_multi_replica.py (new) — CPU-only multi-replica + multi-frontend e2e.

Out of scope

  • Operator companion change to inject DYN_SYSTEM_HOST=status.podIP via downward API. Not blocking — get_local_ip_for_advertise() covers k8s pods out of the box (operator only sets DYN_SYSTEM_PORT, leaving HOST at the 0.0.0.0 default → wildcard branch → interface enum → pod IP). Same posture as request plane / event plane; bind/advertise separation is a runtime-wide follow-up, not a self-host-specific blocker.
  • mx_client peer-direct unification — the URI scheme on CheckedFile is forward-compatible; mx:// already parses today.
  • Scale-to-zero metadata serving.
  • /v1/mdc observability endpoint.

nnshah1 and others added 9 commits April 27, 2026 00:28
Closed five-role enum covering the metadata files the frontend
consumes today (config, tokenizer, tokenizer_config, chat_template,
generation_config) plus a struct describing one entry in the upcoming
files vector on ModelDeploymentCard: filename, opaque uri, blake3
checksum, byte size, and role.

Types only; no call sites yet. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
gh-8749 §2.2)

Adds `files: Vec<ArtifactRef>` to ModelDeploymentCard. The field is
declared #[serde(default, skip_serializing_if = "Vec::is_empty")] so
old MDC bytes deserialize cleanly with an empty vector and new
workers don't bloat the wire format when populating it alongside the
legacy typed fields.

`from_repo_checkout` now populates `files` by walking the same five
typed artifacts (model_info, tokenizer, prompt_formatter,
chat_template_file, gen_config) and synthesizing one ArtifactRef per
reachable CheckedFile. Filenames come from the file's path, URIs are
`file://<absolute>` for local paths or the URL string for URL-backed
files, checksums are reused from CheckedFile's existing blake3, and
sizes are read from filesystem metadata when available.

The legacy typed fields are unchanged: new frontends prefer `files`,
old frontends ignore it. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749 §2.3)

Field defaults to false. Setter mirrors the existing http_* /
tls_* setter pattern on the builder. No behavior wired yet — the
field is read by registration in a follow-up commit. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
gh-8749 §2.4)

self_host_base_url(drt) builds http://<host>:<port> from the actual
bound port (system_status_server_info()) and the existing
DYN_SYSTEM_HOST config. When the configured host is the unspecified
bind sentinel (0.0.0.0/::), substitutes a routable IP detected via
the standard UDP-connect kernel-routing trick — bind a UDP socket,
connect() to a public address, read local_addr(). connect() on UDP
does not transmit, so this works air-gapped. Falls back to
127.0.0.1 when no usable interface is found, yielding a loud
connection-refused on the consumer rather than silently broken
content. Errors immediately when the system_status_server isn't
running so registration doesn't publish an unreachable URL.

self_host_base_url is gated #[allow(dead_code)] for one commit; it
becomes reachable in the §2.6 commit that wires the opt-in path.
detect_routable_ip is exercised by a unit test asserting the result
parses as an IP address.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749 §2.5)

Adds a new lib/runtime/src/metadata_registry.rs module with a
process-local MetadataArtifactRegistry mapping (model_slug,
filename) to on-disk paths, modeled on the existing
EngineRouteRegistry. Stored as a field on DistributedRuntime with
a metadata_artifacts() accessor.

Mounts a new GET /v1/metadata/{model_slug}/{*filename} route on
the system_status_server alongside the /v1/loras precedent. The
route is mounted unconditionally so an empty registry is harmless;
handler returns 404 when (model_slug, filename) is not registered,
500 on a local read error, and the raw file bytes on success.
Consumers blake3-verify against the MDC entry, so the transport is
untrusted by construction.

Six unit tests cover registry roundtrip, miss paths, multi-model
isolation, model unregister, and clone semantics. The opt-in path
that populates the registry lands in §2.6. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749 §2.6)

Threads self_host_metadata from LocalModelBuilder into LocalModel
and consumes it in attach(). When the flag is true, the new
LocalModel::wire_self_hosted_artifacts helper:

  1. Builds the worker's base URL via self_host_base_url (errors
     loudly if DYN_SYSTEM_PORT is unset).
  2. Walks the MDC's `files` vector. For each ArtifactRef whose URI
     is a file:// URL pointing at this worker's local disk:
       - inserts (model_slug, filename) → on-disk path into
         DistributedRuntime::metadata_artifacts(), so the
         /v1/metadata/{slug}/{filename} handler can serve the bytes;
       - rewrites the ArtifactRef.uri in place to that route URL.
     Entries with non-file URIs (hf://, mx://, ...) are left as-is.

The legacy hf:// rewrite of the typed enum fields runs as before, so
old frontends continue to fall through to that path. Removes the
one-commit #[allow(dead_code)] on self_host_base_url. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…-8749 §2.7)

Threads the optional self_host_metadata: Option<bool> through the
PyO3 register_model signature into LocalModelBuilder. Default
remains false: existing worker recipes (vllm/sglang/trtllm)
continue to behave identically; recipe authors flip the kwarg when
they want self-hosting. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…gh-8749 §2.8)

Adds try_self_host_download() on ModelDeploymentCard. download_config
now tries this path before the legacy hub::from_hf fetch:

  - Walks the MDC's `files` vector for entries with http(s):// URIs.
  - GETs each, verifies the body's blake3 against ArtifactRef.checksum
    (loud failure on mismatch — no silent fallback to HF).
  - Atomically writes verified bytes to ~/.cache/dynamo/self-host-metadata/{slug}/
    (tmp-then-rename, so a partial download never leaves a corrupt
    file in place).
  - Calls update_dir(cache_dir) so downstream consumers
    (`tokenizer()`, `prompt_formatter()`, ...) see local paths
    pointing at the verified files.

When the artifact list is empty or carries no http(s) URIs, returns
Ok(false) and download_config falls through to the existing
hub::from_hf path. Old MDC bytes (no `files` field) decode with an
empty vector via #[serde(default)], so this preserves cross-version
compatibility for unmodified workers.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…§2.9)

Four new unit tests:

  - artifact_ref_serde_roundtrip — ArtifactRef survives serde JSON
    with all five fields preserved.
  - artifact_role_serde_snake_case — locks the wire format for the
    closed five-role enum (config / tokenizer / tokenizer_config /
    chat_template / generation_config).
  - mdc_without_files_deserializes_to_empty_vec — cross-version
    compatibility guarantee: MDC bytes without a `files` field
    deserialize cleanly with `files` defaulted to an empty vector,
    so old workers' MDCs interoperate with new frontends.
  - fetch_and_verify_rejects_checksum_mismatch — uses mockito to
    serve a payload whose blake3 doesn't match the expected checksum;
    asserts the helper errors with a clear "checksum mismatch"
    message and leaves no file at the destination (the
    tmp-then-rename atomicity guarantee).

Integration tests against a live deployment (worker opt-in +
mountless frontend, end-to-end /v1/models verification) are deferred
to a follow-up — they need the k8s test harness, not just the unit
suite. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Removes three methods with no production callers and the two tests
that depended on them:

  - unregister_model: no detach path calls it; speculative addition.
    Worker process exit clears the whole map already. When we want
    per-model cleanup, ~10 LOC re-adds it.
  - len, is_empty: only used by tests. Tests now assert correctness
    via get() — the actual contract — instead of the count.

Also drops the `clone_shares_state` test, which only exercised the
Arc-share semantic that #[derive(Clone)] documents at the type level.

Module shrinks from ~125 to ~85 LOC; production API is now
new / register / get. All four remaining tests pass. Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…pty (gh-8749)

Effectively reverts 0d4ab6d.

Restoring the methods on second look:
  - register / unregister: parallel-construction symmetry. Even though
    no production caller invokes unregister yet, the pair reads as a
    natural domain API. Detach paths will need it; having it ready
    keeps that change a one-liner.
  - len / is_empty: standard collection accessors. Lightweight
    (one read-lock each) and useful for observability/diagnostics
    beyond the test suite.

The clone_shares_state test also comes back — small but explicit
coverage of the Arc-share semantic, which is load-bearing for the
DistributedRuntime → SystemStatusServer handoff.

Module returns to its pre-simplification shape (~125 LOC, 6 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…es (gh-8749)

Two renames in service of clearer naming and parallel construction:

  Worker side (local_model.rs):
    wire_self_hosted_artifacts -> move_to_self_host

  Mirrors the existing ModelDeploymentCard::move_to_url pair (same
  shape: walks every typed-enum CheckedFile, rewrites URIs in place;
  opposite of update_dir). "wire" was an outlier — no other lib/
  function uses it.

  Frontend side (model_card.rs):
    try_self_host_download -> try_download_files
    self_host_cache_dir    -> files_cache_dir
    .cache/dynamo/self-host-metadata/  -> .cache/dynamo/model-metadata/

  The frontend consumer is scheme-agnostic: it walks `files`, picks
  http(s):// entries, fetches them, and verifies blake3. It doesn't
  know or care whether the URIs point at the originating worker, a
  peer holder, modelexpress, or some other front-cache. Naming the
  function "self_host" baked in a worker-side concern that has no
  business in the consumer.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749)

Match the legacy `download_config` posture: it doesn't own a disk
cache, it delegates to `hub::from_hf` which manages HF Hub's cache.
The new branch was creating its own persistent directory under
~/.cache/dynamo/model-metadata/, which is asymmetric and accumulates
state in $HOME with no cleanup.

Switch to a per-process scratch directory under `std::env::temp_dir()`
keyed by PID + slug:
  /tmp/dynamo-mdc-<pid>/<slug>/<filename>

Process restart re-fetches (~10 MiB total per model, sub-second on a
DC network — same cost the legacy path pays when HF Hub's cache
misses). The OS reaps tempdir on reboot. No long-running disk
artifact in $HOME, no cleanup logic.

Renames `files_cache_dir` -> `files_download_dir` to reflect that
this is transient scratch, not a cache.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
)

Replaces two ad-hoc helpers with a scheme-uniform resolver and an
HF-Hub-style content-addressed cache.

resolve_uri(client, uri, expected_checksum, dest) dispatches by URI
scheme, so the consumer side is genuinely transport-agnostic:
  - http/https — reqwest GET, body verified
  - file       — tokio::fs::read of the local path, verified
  - hf://repo[@rev]/filename — hub::from_hf, copy from snapshot,
                                verified
All schemes write atomically (tmp + rename); checksum mismatch is a
hard error in every branch.

Cache layout mirrors HF Hub:
  ~/.cache/dynamo/mdc/blobs/<blake3-hex>           — bytes
  ~/.cache/dynamo/mdc/by-slug/<slug>/<filename>    — symlink → blob
Multi-worker dedup is automatic: same blake3 → same blob path → the
existence check skips the fetch. Frontend restart hits the warm
cache. Cleanup is user-managed, same posture as HF Hub.

try_download_files becomes download_files, walks every entry in the
files vector through resolve_uri, and update_dirs the typed
CheckedFiles at the per-slug symlink directory. Returns Ok(false)
only when files is empty so the caller falls through to the legacy
hub::from_hf path; non-empty files but a single resolution failure
is loud.

move_to_url now also walks the files vector, rewriting `file://`
entries to the same `hf://repo/filename` URLs it produces for the
typed enums. Self-host (`http://`) entries are preserved — the
file://-prefix gate keeps move_to_url from clobbering them when it
runs after move_to_self_host. Shared-volume deployments
(source_path.exists() on the worker) skip move_to_url entirely, so
their `file://` URIs are preserved across both surfaces.

Updates the in-source mockito test to call resolve_uri instead of
the removed fetch_and_verify, plus a parse_hf_uri unit test.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749)

Five new tests in lib/llm/tests/model_card.rs alongside the existing
TinyLlama-fixture tests:

  - test_files_vector_populated_from_local_repo — load TinyLlama,
    assert mdc.files has entries with file:// URIs, blake3 checksums,
    sizes, and roles drawn from the closed five-role enum.
  - test_download_files_resolves_local_file_scheme — round-trip
    file:// resolution into the content-addressed cache, verify blob +
    by-slug symlink layout, assert the tokenizer loads via the
    resolved path.
  - test_download_files_dedupes_by_blake3 — call download_config
    twice, assert blob mtimes are unchanged on the second call (the
    blake3-keyed exists() check skips the fetch).
  - test_download_files_rejects_blake3_mismatch — tamper with one
    ArtifactRef.checksum, assert download_config errors with a clear
    "checksum mismatch" message and leaves no blob behind.
  - test_download_files_resolves_hf_scheme_with_token — HF_TOKEN
    runtime-skip pattern (matches preprocessor.rs); same Llama-3.1
    fixture the existing preprocessor tests use; exercises the
    resolve_uri hf:// branch end-to-end.

Tests are serialized via serial_test::serial because they share the
$HOME-rooted MDC cache. Each test installs its own tempdir-backed
$HOME so it doesn't pollute the user's real ~/.cache/dynamo/.

Bug fix: artifact_ref_from_checked_file canonicalizes the local path
before constructing the file:// URI. url::Url::from_file_path
requires an absolute path; relative paths returned None and the
files vector came out empty for any caller that loaded the MDC from
a relative path (production goes through LocalModelBuilder::build
which already canonicalizes; the bug only manifested in tests and
direct load_from_disk callers). Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…lt (gh-8749)

Lets deployments turn self-hosting on without modifying every worker
recipe. When DYN_SELF_HOST_METADATA is set to a truthy value (1,
true, yes, on; case-insensitive), `LocalModelBuilder::default()`
initializes `self_host_metadata` to true. Explicit setter calls
(`.self_host_metadata(...)`) and explicit PyO3 kwargs always win
over the env-var default.

Also tightens the PyO3 binding: the builder method is now only
called when the kwarg is `Some(_)`. An unset kwarg falls through to
the builder's default, so the env var is honored end-to-end.

Two unit tests cover the truthy/falsy parsing, gated by
serial_test::serial because they share a process-global env var.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
gh-8749)

Tightens doc comments and inline comments across the new self-host
code to match the surrounding files' style: brief one-line docs on
public APIs, no doc on simple private helpers, longer prose only
where the WHY is non-obvious. Removes design rationale, motivation,
and "mirror existing pattern" prose — the type signatures and
existing tests carry that load.

Also adds tests/frontend/test_vllm_self_host_metadata.py — a local
e2e clone of test_vllm.py with the only diff being
DYN_SELF_HOST_METADATA=true in the worker process env. Verifies the
full chain by asserting the model lands on /v1/models. Uses QWEN
(Qwen3-0.6B). Marked vllm/gpu_1/e2e/post_merge so it slots into the
existing vLLM e2e CI gate.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
@nnshah1 nnshah1 changed the title DEP draft (light): Worker self-hosted metadata files Worker self-hosted metadata files Apr 27, 2026
…h-8749)

Strengthens the Python e2e to actually validate that the frontend
fetched via the self-host http path: the frontend process now runs
under an isolated HOME=<tmp_path> so any cache writes are scoped
to the test, and the test asserts at least one blake3-keyed blob
appears under <HOME>/.cache/dynamo/mdc/blobs/. That directory is
only ever written by ModelDeploymentCard::download_files, so its
contents prove the http resolver ran end-to-end (the legacy
hub::from_hf fallback writes to the HF Hub cache, not to this
directory).

Adds a DRT-level e2e in lib/llm/tests/model_card.rs under
#[cfg(feature = "integration")] mod integration_tests, modeled on
lib/llm/tests/http_metrics.rs. Real DistributedRuntime with
DYN_SYSTEM_PORT=0, real worker LocalModel::attach -> ModelWatcher
-> download_config -> cache populated. Asserts blob count matches
the artifact list and the tokenizer loads. Gated #[ignore] because
it requires etcd; run with
cargo test -p dynamo-llm --test model_card --features integration -- --ignored.

Parallelism:
  - Within a binary: #[serial_test::serial] because HOME and
    DYN_SYSTEM_PORT are process-global.
  - Across cargo test processes: each gets its own tempdir-rooted
    HOME, its own random port (DYN_SYSTEM_PORT=0), and a unique
    etcd namespace (self-host-it-<pid>-<uuid>); the watcher uses
    NamespaceFilter::Exact so two parallel runs don't see each
    other's MDCs.

Replaces the bare tempfile::TempDir isolated_home() helper with a
true RAII guard (IsolatedHome) that restores the previous HOME on
drop, so subsequent tests aren't left pointing at a deleted
tempdir path.

Refs #8749.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
@nnshah1

nnshah1 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor Author

The size field was added for two reasons: (1) per-file proactive cap on transfers, (2) is_new_format() discriminant so old workers stay on the legacy path. We can drop it and use ABSOLUTE_MAX_METADATA_BYTES (1 GiB) as the cap — slight behavior change for old HF-source workers that would gain blake3 verification on the new frontend, but the previous code didn't explicitly check hashes on metadata file contents, so we're trading silent corruption on HF drift for loud failure, which is overall good. Tangentially: the same HF version-drift problem makes me lean toward removing the HF fetch path entirely going forward — metadata should always be self-hosted, and we can deprecate the hf:// and shared-mount fallbacks rather than continue maintaining them.

Apply the same lessons from grahamking's review across the rest of
the PR's new code:

- ENV_SELF_HOST_METADATA constant doc trimmed from 5 lines of
  parsing rules to a one-liner, matching the existing convention
  in lib/runtime/src/config/environment_names.rs
  (e.g. DYN_SYSTEM_HOST/PORT/etc).
- LocalModelBuilder::self_host_metadata setter doc rewritten to
  match the runtime config field convention from
  lib/runtime/src/config.rs ("Set this at runtime with environment
  variable DYN_FOO").
- model_card.rs: demote the "resolved model metadata files" log
  from info! to debug! — once-per-registration is too quiet for
  info, same pattern we just demoted in move_to_self_host.
- model_card.rs::stream_to_tmp: cap.saturating_add(1) -> cap + 1
  (cap bounded by ABSOLUTE_MAX_METADATA_BYTES = 1 GiB, no overflow).
- model_card.rs::stream_to_tmp: surface flush errors via
  with_context()? instead of silent .ok(); a flush failure right
  before atomic-rename would mean the staged bytes weren't fully
  on disk.
- local_model.rs::move_to_self_host: use fs::canonicalize (the
  module is already imported via `use std::fs;` at top of file).

Signed-off-by: nnshah1 <neelays@nvidia.com>
…ymlink (gh-8749)

`move_to_self_host` was calling `fs::canonicalize()` on the local
path and then taking `file_name()` from the canonical result. For
workers backed by the HuggingFace cache, snapshot/<rev>/<filename>
is itself a symlink into blobs/<git-sha1>; canonicalize follows it
and returns the SHA1 as the file_name, so:

  * the http URL becomes /v1/metadata/<slug>/<sha1>
  * the per-(slug, mdcsum) symlink in the frontend's cache is
    placed at <slug>/<mdcsum>/<sha1>
  * downstream code that opens files by literal name (e.g.
    `HFConfig::from_json_file` doing
    `parent.join("generation_config.json")` to fall back to
    generation_config.json for `eos_token_id`) fails with
    "Failed to open file" — the file *is* in the cache, just
    under a different name.

This bit the vllm Multi-GPU `mm_e_pd_llava-1.5-7b` test in CI
because llava-1.5-7b's `eos_token_id` lives only in
generation_config.json (Qwen3-0.6B has it directly in
config.json, which is why our mocker tests didn't catch it).

Fix: derive `filename` from the original (pre-canonicalize)
path so the URL, the per-slug symlink, and the registry key all
use the human-readable name. Keep canonicalize for the
*registry value* — the http handler still needs an absolute
path to the actual blob to serve bytes.

Local repro added (and passes after fix):

  pytest -xvs tests/frontend/test_llava_repro.py
  ========================= 1 passed in 13.63s =====================

Existing mocker tests still pass:

  pytest -xvs tests/frontend/test_mocker_self_host_multi_replica.py
  ========================= 4 passed in 69.12s =====================

Signed-off-by: nnshah1 <neelays@nvidia.com>
)

`ModelDeploymentCard::update_dir` deliberately skips the
chat_template_file slot when `is_custom == true` — the legacy
`hub::from_hf` fall-through doesn't fetch user-supplied custom
templates (they aren't on HF), so leaving the path untouched kept
the original local one for the legacy code path.

In the new-format path we *do* fetch every populated slot
(including custom templates served via the worker's self-host
route), so the URL-skip exception no longer applies. The previous
exception left `chat_template_file.path()` as a URL post-resolve,
and `PromptFormatter::from_mdc` would bail with:

  HfChatTemplateJinja for <model> is a URL, cannot load

Caught by the trtllm Multi-GPU `raw_embeddings_epd-2` test
(which uses `--custom-jinja-template` via `agg_raw_embeddings_llava.sh`).

Fix: in `resolve_metadata_files`, after fetching every slot via
`iter_metadata_files`, walk `iter_metadata_files_mut()` and call
`CheckedFile::update_dir(&slug_dir)` on each — same enumeration
as the fetch loop, so we cover exactly the files we just fetched.
This bypasses the legacy `is_custom` exclusion that's still load-
bearing for the legacy `hub::from_hf` fall-through.

Verified locally: existing mocker suite (3 schemes + multi-frontend)
+ llava repro all pass: 5 passed in 82.63s.

Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749)

Comment-only follow-up to 262f314 + 1257fe0. Trim two
WHY blocks (7-line and 6-line) down to 3-4 lines each while
keeping the load-bearing reason for each non-obvious choice
(canonicalize-vs-original-filename, and skip-self.update_dir
because of legacy is_custom exclusion).

Signed-off-by: nnshah1 <neelays@nvidia.com>
Comment thread lib/llm/src/model_card.rs
}

impl BlobLock {
async fn acquire(blob_path: &Path) -> anyhow::Result<Self> {

@grahamking grahamking Apr 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this function please?
If the idea is to lock the file because several concurrent processes are writing it, that should not be the case. Each machine only has one frontend, and the cache should be local to the machine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most production settings that's likely the case — but not guaranteed, and it also breaks our current parallel testing strategy. We use flock because it prevents both intra-process and inter-process races (workers across different worker sets can register at the same time and land on the same blob if they share file content).

Two cases the operator's one-frontend-per-pod model doesn't cover:

  1. Local dev / CIpytest runs two frontend processes against one mocker (see tests/frontend/test_mocker_self_host_multi_replica.py::test_two_frontends_share_metadata_cache_via_flock); both share $HOME and race on the same ~/.cache/dynamo/mdc/blobs/<blake3>.
  2. Cross-worker-set in one frontend — two MDCs (different model_name or namespace) carrying a shared file via blake3 (e.g., two fine-tunes of one base model with identical tokenizer.json). Their registrations run concurrently in the discovery watcher and would write the same blob without serialization.

This is the same pattern huggingface_hub uses internally — fcntl-based locks under ~/.cache/huggingface/hub/.locks/ around _download_to_tmp_and_move to coordinate concurrent downloads of the same blob. flock(LOCK_EX) covers both intra- and inter-process races via per-open-file-description semantics, so it's one mechanism.

Comment thread lib/llm/src/model_card.rs Outdated
fn blake3_hex_of(checksum: &str) -> anyhow::Result<&str> {
checksum
.strip_prefix("blake3:")
.with_context(|| format!("expected blake3 checksum, got: {checksum}"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a Checksum type in llm/src/common/checked_file.rs, might be able to use that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. 63728476cc adds pub fn hash(&self) -> &str (and algorithm()) on Checksum, drops the blake3_hex_of helper, and reads the hex directly.

Comment thread lib/llm/src/model_card.rs Outdated

let mut fetched = 0usize;
for (uri, expected, filename) in &entries {
let blake3_hex = blake3_hex_of(&expected.checksum().to_string())?.to_string();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here checksum() could return a Checksum which knows it's hash.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same patch (63728476cc). expected.checksum().hash() replaces the to_string() + strip_prefix round-trip.

Comment thread lib/runtime/src/metadata_registry.rs Outdated

use std::collections::HashMap;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use parking_lot::RwLock instead. Most of the code base migrated to that a while back. It avoids the unwrap(), and has several other benefits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 63728476cc — swapped std::sync::RwLock for parking_lot::RwLock; the .unwrap() calls go away.

@grahamking

Copy link
Copy Markdown
Contributor

I ran it through graham-code-review and it said this 😂

Nits

  • Some of the new comments are longer than the code they describe. The cache/lock comments are useful; the obvious “this function does X” ones can go.

…h-8749)

- `Checksum::hash()` / `algorithm()` accessors so callers can read the
  blake3 hex without round-tripping through `Display` + `strip_prefix`.
- Drop `blake3_hex_of` helper in `model_card.rs`; callers use
  `cf.checksum().hash()` directly.
- Swap `std::sync::RwLock` for `parking_lot::RwLock` in
  `metadata_registry`; drops `.unwrap()` calls and matches the rest of
  the codebase.

Per Graham's review on PR #8754.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749)

Drop comments that just restate function names / short signatures:
- `symlink_force` doc reduced to the concurrency note.
- `filename_from_checked_file`, `pf_checked_file`, `copy_to_tmp` —
  function names + signatures already say what they do.
- `move_to_self_host` 6-line preamble collapsed to 3 lines that capture
  the URL-vs-local dispatch (the non-obvious part).
- `clear_self_hosted_artifacts` 8-line block collapsed to 2.
- `self_host_base_url` host-selection paragraph removed; the IPv4/v6
  resolver behavior lives in `ip_resolver`, not here.

Cache/lock/concurrency rationale (BlobLock, stage_and_rename,
unique_tmp_path, resolve_uri scheme list) kept verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the Stale label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feat size/XXL Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants