feat(self-host): frontend MDC verify-and-cache pipeline (gh-8749)#9057
Conversation
WalkthroughA workspace dependency on ChangesArtifact Caching & Metadata Resolution
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/model_card.rs`:
- Around line 205-216: The call to AsRawFd and libc::flock inside async fn
acquire (the block that creates/opens lock_path and calls unsafe {
libc::flock(f.as_raw_fd(), libc::LOCK_EX) }) is Unix-only and must be guarded;
wrap the flock-specific code in a #[cfg(unix)] so that the current
spawn_blocking closure uses AsRawFd and libc::flock only on Unix, and provide a
#[cfg(not(unix))] fallback that still opens/creates the same lock_path (e.g.,
open the file with OpenOptions) but implements a Windows-compatible lock (call
LockFileEx via winapi) or, if you intentionally accept no-op locking on
non-Unix, document that and open the file without calling flock; ensure symbols
mentioned (acquire, lock_path, AsRawFd, libc::flock,
tokio::task::spawn_blocking) are adjusted accordingly.
- Around line 937-951: The metadata HTTP client is created with
reqwest::Client::new() which has no timeouts; replace that with a configured
client via reqwest::Client::builder() and set explicit connect and overall
timeouts (e.g., std::time::Duration values) and, if you need a separate read
timeout, ensure resolve_uri uses request.timeout(...) on the RequestBuilder or
set per-request timeouts there; then pass the built client to resolve_uri
instead of Client::new(). Target symbols: replace the reqwest::Client::new()
call in download_config()/the loop, configure via
Client::builder().connect_timeout(...) and .timeout(...).build(), and ensure
resolve_uri(&client, ...) continues to receive the client with these timeouts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f90afcbd-440b-4d0d-a842-6b4c099ebb6c
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
lib/llm/Cargo.tomllib/llm/src/common/checked_file.rslib/llm/src/model_card.rs
Two CodeRabbit findings on PR #9057: 1. `BlobLock::acquire` used `AsRawFd` + `libc::flock` unconditionally, which won't compile on non-Unix targets. Wrap the flock-specific block in `#[cfg(unix)]`. The non-Unix path still opens the lock file (so atomic-rename invariants hold for intra-process callers) but skips cross-process locking with a documented caveat — dynamo's frontend ships on Linux, this is build-time portability. 2. `reqwest::Client::new()` in async mode has no default timeout. A stalled metadata endpoint would hang `download_config()` indefinitely. Build the client with explicit `connect_timeout(10s)` + total `timeout(300s)`. 5 min covers a 1 GiB file at slow speeds; realistic metadata files complete in seconds. Refs #8749 #9057 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
…d skew (gh-8749) Diagnostic for the trtllm Multi-GPU `raw_embeddings_epd-2` failure on PR #9057. The frontend errors with "missing eos_token_id in config.json and generation_config.json, cannot load" because slug_dir lacks generation_config.json. Two unknowns to surface: 1. **Worker side:** when does `from_repo_checkout` settle on `gen_config = None`? `GenerationConfig::from_disk(local_path).ok()` silently swallows errors. If the file IS on disk but loading failed, that's a real bug — log it as a warning. If the file is genuinely absent, log at debug. Either way we get attribution. 2. **Frontend side:** when `resolve_metadata_files` runs, log which slots were populated on the MDC at info level. A None gen_config slot here means no symlink in slug_dir, which is exactly the condition that produces the eos_token_id error downstream. Both diagnostics ship at info/warn so they'll show up in CI logs without extra config. Will revert the info-level slot log to debug once the test failure is understood. Refs #8749 #9057 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
…h-8749) `resolve_metadata_files`'s file:// arm was the frontend mirror of the canonicalize→file_name anti-pattern that the capstone branch already fixed for the worker side in `local_model.rs::move_to_self_host` (commit 262f314). Same shape, different code path: - Worker (capstone): synthesizes `http://<worker>/v1/metadata/<slug>/<filename>`, used to take filename from canonicalized path → blob hash. - Frontend (this PR): synthesizes `file://<canonicalized>` + records filename for slug_dir symlink, was taking filename from canonicalized path → also blob hash. Locally reproduced against the actual llava-v1.6-mistral-7b-hf snapshot at the test's pinned revision 52320fb52229: `download_config` ran to completion but the slug_dir held entries named for the HF blob hash: d84663b8bd0a737e3299f3d2b4be2fe3cd2a0333 (was: generation_config.json) 8bc605421e7ff49a5f5ab76f2a26fe0de73cf9d3 (was: config.json) 6f43c1fabc73af13375a3b0c264656711228cd4a (was: tokenizer.json) 754adb1891a73bab3f53b94218e6665ec7fc583c (was: tokenizer_config.json) Pass 3's `cf.update_dir(&slug_dir)` then rewrote each cf.path to `<slug_dir>/<logical-filename>` — pointing at nonexistent entries. Downstream `HFConfig::from_json_file` doing `file_path.parent().join("generation_config.json")` to fall back to generation_config.json for `eos_token_id` failed with the test's "missing eos_token_id" error. Fix mirrors the capstone: derive the logical filename from cf.path() BEFORE canonicalize. Canonicalize is still required for `Url::from_file_path` to accept the path, but its result is used only for the URL, not for naming. Regression test: `resolve_metadata_handles_hf_cache_symlinks` builds an HF-cache-style snapshot (per-file symlinks into a sibling blobs/ layout) and asserts slug_dir has logical filenames, no blob-hex leakage. Verified locally against the real llava-v1.6 snapshot. Refs #8749 #9057 Refs capstone fix 262f314 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
43f3aa7 to
58e34ac
Compare
Pass 1 was rewriting cf.path to a `file://<canonicalized>` URL when the worker's path was a local file. For HF cache snapshots that path resolves through a symlink into `blobs/<sha>`, so the URL's basename became the blob hash. Pass 3's `update_dir` then derived the cf's new path from that basename, leaving cf.path at `<slug_dir>/<blob-hash>` instead of `<slug_dir>/<filename>` — the symlinks Pass 2 placed at `<slug_dir>/<filename>` were unreachable from cf.path, and downstream code reading siblings by name (e.g. `HFConfig` looking for `generation_config.json` next to `config.json`) hit "Failed to open file". Fix mirrors the worker-side fix in `move_to_self_host` from the capstone: don't mutate cf to a canonicalized URL — construct the fetch URI as a local string and leave cf.path as the worker published it. Pass 3's `update_dir` now operates on the original PathBuf and naturally writes back the logical filename. Same regression class CodeRabbit caught for the http:// arm, just on the file:// arm. The existing `download_config_pipelines_local_files _through_cache` test now uses an HF-cache-style symlink fixture so this path stays covered. Refs #8749 #9057 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
gh-8749) Pass 1 collapses to one map() call with two helpers. Replaces `std::fs::canonicalize` with `std::path::absolute` — same effect for `Url::from_file_path` (needs absolute) without an extra syscall or symlink-follow. Drops verbose comments that were stating the obvious. Refs #8749 #9057 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
…8749) Worker `move_to_self_host` swaps `fs::canonicalize` for `std::path::absolute`. Filename derivation is unchanged (still pre- canonicalize per capstone 262f314). Now both worker and frontend URL/registry construction follow the same pattern: pre-canonicalize filename + non-symlink-resolving absolute path. The http handler's per-request symlink follow is the only behavioral delta; HF cache blobs are append-only so the resolved path doesn't go stale. Refs #8749 #9057 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
- Drop is_custom_chat_template helper; use existing
PromptFormatterArtifact::is_custom() directly.
- Fold iter_metadata_entries into iter_metadata_files (single source
of truth, returns Vec<(&CheckedFile, bool)>); 3 callers destructure
the bool they don't need.
- Replace expect("path or url") with let-else+bail.
- Trim verbose comments on ABSOLUTE_MAX_METADATA_BYTES and the cap
computation; clarify cache-hit comment is verify-on-write only
(no contradictory implication of re-verification).
- Pre-resolve hf:// repos once per unique repo before the per-file
fetch loop; avoids N hub::from_hf calls for one model registration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
stage_and_rename and stage_and_rename_sync only cleaned up the tmp on Err or rename-failure. An async cancellation (handle_put aborted by duplicate Added or delete-races-put) drops the future mid-await and leaves .tmp.<pid>.<uuid> orphans. Wrap the tmp path in a TmpGuard whose Drop unlinks it; dismiss the guard after successful rename so the no-op unlink (post-rename ENOENT) is skipped. Test: drop the stage_and_rename future via tokio::time::timeout to simulate cancellation, verify the tmp file is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Same-(model, namespace, type) registrations are already serialized by ModelWatcher's registering_worker_sets; only one task per MDC reaches download_config. Atomic rename + unique tmp + content addressing handles the cross-model and cross-process cases without a lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
No longer used after BlobLock removal. `disk.rs` uses `nix::libc::*` (re-exported from nix), so dropping the direct dep is safe. Caught by cargo-machete on the pre-merge check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Closes the bit-rot / shared-cache-poison / our-write-path-bug window. Cost: blake3 over the file (~5 ms typical, up to ~30 ms for a 30 MB tokenizer.json) per first-worker-per-MDC. Steady state and worker rolls don't trigger it — the watcher's registering_worker_sets dedup makes download_config a per- (model, namespace, type) operation, and subsequent workers join the existing WorkerSet without resolving again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
|
@nicolasnoble — both cross-cutting items addressed:
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
nicolasnoble
left a comment
There was a problem hiding this comment.
One small thing on 81b4358's commit message: the safety argument "watcher serializes via registering_worker_sets so the lock is redundant" is incomplete - download_config is also called from lib/bindings/c/src/lib.rs:1501 outside the watcher. The actual reason the removal is safe is the trio of content-addressed paths + unique tmp + verify-before-rename, where concurrent writers publish identical bytes. Worth a sentence in the commit body or a code comment near resolve_uri so future readers don't misread the guard's scope.
| /// reject anything that doesn't look like the algorithm's canonical | ||
| /// digest. Called from both `Deserialize` and `TryFrom<&str>` so both | ||
| /// wire entry points stay in lockstep. | ||
| fn validate_checksum_hash(algorithm: CryptographicHashMethods, hash: &str) -> Result<(), String> { |
There was a problem hiding this comment.
This doesn't check the hash is correct, it only checks that it looks like a hash.
I gather that's intended, but it's confusing. Could you check the shape when you check the value also?
/// Worker-controlled wire field used as a path component downstream;
What does that mean? Is it relevant here.
| model.namespace_prefix(), | ||
| ); | ||
| let local_model_path = | ||
| (!model.path().as_os_str().is_empty()).then(|| model.path().to_path_buf()); |
There was a problem hiding this comment.
In here and the other places, why do you do as_os_str() before checking is_empty()? Likely can simplify.
| let Some(filename) = local_path | ||
| .file_name() | ||
| .and_then(|f| f.to_str()) | ||
| .map(|s| s.to_string()) | ||
| else { | ||
| continue; | ||
| }; | ||
| // Canonicalize so the handler serves the resolved HF blob, not the symlink. | ||
| let absolute = match fs::canonicalize(&local_path) { | ||
| let absolute = match std::path::absolute(&local_path) { |
There was a problem hiding this comment.
From the docs:
unlike canonicalize, this does not resolve symlinks and may succeed even if the path does not exist
You deleted the comment about symlinks, does that mean we don't have symlinks here any more? This one seems like it might break things.
| /// isolates worker sets that share a model name but publish different | ||
| /// file content. | ||
| fn mdc_cache_root() -> PathBuf { | ||
| let home = std::env::var("HOME") |
There was a problem hiding this comment.
Nit: Normal practice is to import the model, so this should be just env::var(...
| "tmp.{}.{}", | ||
| std::process::id(), | ||
| uuid::Uuid::new_v4().simple() | ||
| ); |
There was a problem hiding this comment.
Could you use the tempfile crate instead? We already have that.
| /// ignores ENOENT — safe under double-cleanup races. | ||
| struct TmpGuard { | ||
| path: Option<PathBuf>, | ||
| } |
| .with_context(|| format!("renaming {} -> {}", tmp.display(), dest.display()))?; | ||
| guard.dismiss(); | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Why do you have an async and a sync version? Are they both used?
| Ok(()) | ||
| } | ||
|
|
||
| async fn copy_to_tmp(src: &Path, tmp: &Path, cap: u64) -> anyhow::Result<()> { |
There was a problem hiding this comment.
What do you use this for? I thought you streamed to tmp and then rename?
|
Could you do another pass and push your agent to simplify? This feels like you accept it's first draft, both in the tests and in the complex temp/rename/copy setup. |
What
Frontend slice of #8749. Every metadata
CheckedFile(config.json, tokenizer.json, …) flows through one resolve pipeline: derive a URI, fetch via the URI's scheme handler, blake3-verify, atomically publish to a content-addressed cache at~/.cache/dynamo/mdc/. Builds on PR1 (#8855) which shipped the worker side.Resolution chain
checked_file_uriperCheckedFile. Path-only slots are coerced to a syntheticfile://URL up front, then:http://,https://,hf://→ use as-is.file://and the file resolves locally → use as-is.file://missing, but<--model-path>/<basename>exists → rewrite to that. Lets a frontend host a local copy when the worker's path is unreachable andsource_pathisn't a valid HF repo id.hf://<source_path>/<basename>(legacy).Worker-published checksums always preserved.
size: Option<u64>is an optional cap-tightener (1 GiB absolute ceiling fallback).Cache layout
flock(LOCK_EX)per blob serializes intra/inter-process writers; multiple frontends sharing\$HOMEcollapse to one fetch per blake3.Not in this PR
DYN_SELF_HOST_METADATA→ PR4 (capstone) along with multi-replica e2e.Test plan
cargo build -p dynamo-llmcleancargo test -p dynamo-llm --lib -- model_card common::checked_filegreencargo clippy -p dynamo-llm --lib -- -D warningscleancargo fmt --checkcleanDYN_SELF_HOST_METADATA=trueworker → frontend → http resolve → cache populated--model-path /local/copyoverrides unreachable worker paths → cache populated