feat(self-host): auto-clean metadata-registry on detach (gh-8749) by nnshah1 · Pull Request #10351 · ai-dynamo/dynamo

nnshah1 · 2026-06-05T06:43:12Z

Overview

MetadataArtifactRegistry was a write-only sink: every move_to_self_host added entries via register, but nothing ever removed them on detach. The only cleanup method on the worker side (LocalModel::clear_self_hosted_artifacts) required a LocalModel handle that the static LocalModel::detach_from_endpoint helper — and through it, the Python unregister_model binding — doesn't have. Long-running workers leaked one entry per metadata file on every LoRA detach or model reload. This is the documented blocker on flipping DYN_SELF_HOST_METADATA default-on in the gh-8749 capstone.

Fix

Tag each registration with Owner = (connection_id, lora_slug) — the pair detach_from_endpoint already has on hand without an API change to the Python side. The registry keeps a secondary owners -> (slug, suffix) map populated by register. New unregister_for_owner(&owner) looks up that map and calls the existing unregister so there is no duplicated cleanup logic. The HTTP get read path is unchanged — no new locks on the hot path.

Details

lib/runtime/src/metadata_registry.rs: added pub type Owner = (u64, Option<String>), new owners map field, new unregister_for_owner method, register signature gains a leading Owner parameter (single caller in the workspace — no external breakage). Existing register / unregister / get semantics preserved.
lib/llm/src/local_model.rs: move_to_self_host now takes &Endpoint instead of &DistributedRuntime (one extra line to derive drt) and passes (drt.connection_id(), model_suffix.map(str::to_string)) as the owner. detach_from_endpoint calls unregister_for_owner before the discovery unregister — no-op when self-host was disabled or skipped. Deleted the never-called clear_self_hosted_artifacts method and the supporting attached_self_host_suffix: Option<String> field that only tracked one suffix per LocalModel anyway.

Test

Existing two metadata_registry tests updated to the new register signature (semantics unchanged).
New unregister_for_owner_clears_only_that_owner: registers one base + one LoRA from the same connection_id, detaches the LoRA via unregister_for_owner, asserts the LoRA entries are gone and base entries remain. Second call is a no-op (idempotent).
Full dynamo-llm + dynamo-runtime lib test suites, clippy with -D warnings, and cargo fmt --check all green.

Out of scope

The Python unregister_model API is unchanged — callers in components/src/dynamo/vllm/handlers.py do not need to pass a model name.
Default-on flip of DYN_SELF_HOST_METADATA — still gated on this PR plus a multimodal positive e2e; tracked in the DEP (light): Worker self-hosted metadata files #8749 capstone.

DEP (light): Worker self-hosted metadata files #8749 — self-hosted metadata stack
feat(self-host): native preprocessor consumes resolved slug_dir (gh-8749) #10037 — native preprocessor consumes resolved slug_dir (PR4a)
feat(self-host): worker hosts metadata files via system_status_server (gh-8749) #8855 (worker HTTP hosting), feat(self-host): frontend MDC verify-and-cache pipeline (gh-8749) #9057 (frontend verify-and-cache), fix(self-host): harvest non-weight siblings into slug_dir (gh-8749) #9610 (frontend HF-sibling harvest), feat(self-host): worker advertises non-typed metadata siblings (gh-8749) #9707 (worker extra_files) — earlier merges in the stack

Summary by CodeRabbit

Refactor
- Improved metadata registry for self-hosted model instances using enhanced owner-based tracking
- Streamlined artifact lifecycle management and cleanup for local model deployments

`MetadataArtifactRegistry` accumulated entries on every register but had no caller-reachable cleanup path: `clear_self_hosted_artifacts` needed a `LocalModel` handle, and the static `detach_from_endpoint` helper that the Python `unregister_model` binding calls doesn't have one. Long-running workers leaked entries on every LoRA detach or model reload — the documented blocker on flipping `DYN_SELF_HOST_METADATA` default-on. Fix: tag each registration with `Owner = (connection_id, lora_slug)`, the pair `detach_from_endpoint` already has. The registry keeps a secondary `owners -> (slug, suffix)` map; `unregister_for_owner` reads it and calls the existing `unregister` so there is no duplicated cleanup logic. HTTP `get` is unchanged — no new locks on the read path. The previous `attached_self_host_suffix` field and the never-called `clear_self_hosted_artifacts` method are deleted. Tests: existing 2 metadata_registry tests updated to the new `register` signature; new `unregister_for_owner_clears_only_that_owner` asserts LoRA detach leaves the base intact and is idempotent. Signed-off-by: nnshah1 <neelays@nvidia.com>

coderabbitai · 2026-06-05T06:46:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 68e0fdc6-ef9e-4e0c-8387-8497c22fe536

📥 Commits

Reviewing files that changed from the base of the PR and between 7d5ca77 and 7524b19.

📒 Files selected for processing (2)

lib/llm/src/local_model.rs
lib/runtime/src/metadata_registry.rs

Walkthrough

This PR refactors metadata artifact registration to use owner-scoped tracking. MetadataArtifactRegistry now associates artifacts with an explicit Owner tuple (connection_id, lora_slug), supports per-owner cleanup via unregister_for_owner, and LocalModel integrates the new API by computing the owner during self-host rewrite and using it to register and clean up metadata.

Changes

Owner-scoped metadata artifact registration

Layer / File(s)	Summary
Metadata registry owner-scoped contract and implementation `lib/runtime/src/metadata_registry.rs`	`MetadataArtifactRegistry` defines `Owner = (u64, Option<String>)`, adds internal `owners` index mapping owners to `(slug, suffix)` pairs, updates `register` signature to include owner parameter, adds `unregister_for_owner(&Owner)` method for owner-scoped cleanup, and tests are updated to register with owners and verify selective removal by owner.
LocalModel integration with owner-based metadata registration `lib/llm/src/local_model.rs`	`LocalModel` removes the `attached_self_host_suffix` field; `attach` passes the full `Endpoint` to `move_to_self_host`; `move_to_self_host` computes owner from `connection_id` and `model_suffix`, then calls `register(owner, ...)` with the owner parameter; `detach_from_endpoint` calls `unregister_for_owner(&(instance_id, model_suffix))` to clean up owner-scoped metadata entries.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: introducing automatic cleanup of metadata-registry entries on model detach, addressing the ownership tracking and unregistering mechanism described throughout the PR.
Description check	✅ Passed	The description comprehensively covers all template sections: Overview clearly states the problem and solution, Details enumerate specific file changes and their rationale, Test describes validation approach, and Related Issues are linked appropriately.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dynamo-ops · 2026-06-05T06:51:01Z

-/// Cloning shares the underlying map.
+/// `(connection_id, lora_slug)` — identifies what `detach_from_endpoint`
+/// has on hand. `None` lora_slug = base model.
+pub type Owner = (u64, Option<String>);


Owner is only (connection_id, lora_slug), but one runtime can register multiple endpoints and repeated base/LoRA slugs, so later registrations overwrite the owner index and detaching one endpoint can leak its artifacts or unregister artifacts still used by another endpoint. Fix: make the owner include the endpoint identity and reference-count/shared-owner track (slug, suffix) entries so they are deleted only after the last owner detaches.

🤖 AI Fix

Update lib/runtime/src/metadata_registry.rs so Owner is a struct/key containing connection_id, namespace, component, endpoint, and lora_slug, store owner references per (slug, suffix) instead of a single Owner -> (slug, suffix) value, and make unregister_for_owner remove entries for a (slug, suffix) only when no remaining owner references it; update LocalModel::move_to_self_host and LocalModel::detach_from_endpoint in lib/llm/src/local_model.rs to build this owner from endpoint.id(), drt.connection_id(), and the slugified LoRA name.

dynamo-ops · 2026-06-05T06:51:01Z

+        // No-op when self-host wasn't enabled or when DYN_SYSTEM_PORT was
+        // unset; matches `move_to_self_host`'s `(connection_id, lora_slug)`
+        // owner key.
+        drt.metadata_artifacts()


detach_from_endpoint removes local self-hosted metadata before discovery.unregister succeeds, so any transient unregister failure leaves a still-advertised model card whose /v1/metadata/... URLs now 404. Fix: perform unregister_for_owner only after discovery.unregister(instance).await? returns successfully, or restore the registry entries on failure.

🤖 AI Fix

In lib/llm/src/local_model.rs, move the drt.metadata_artifacts().unregister_for_owner(...) call in LocalModel::detach_from_endpoint to immediately after discovery.unregister(instance).await?, preserving the owner key before model_suffix is moved into DiscoveryInstance.

Address graham-code-review feedback on PR #10351: - Drop the secondary `owners` map; store `Owner = (instance_id, lora_slug)` inline with each entry value. One lock, one source of truth, no nested-write-lock hazard, no two-map sync risk. - `register` takes `&Owner` (one clone inside, not per-file). - Panic on collision: re-registering the same (slug, suffix, filename) with a different owner is a programming error (two attaches of the same model+suffix in one process would let detach-#1 wipe files detach-#2 still needs). Same-owner re-register is fine and just updates the path. - Doc + local var naming aligned on `instance_id` to match `local_model.rs`'s existing usage (the value populates `DiscoveryInstance::Model.instance_id`). - Tests: collision panic + same-owner update path coverage. Signed-off-by: nnshah1 <neelays@nvidia.com>

nnshah1 requested a review from a team as a code owner June 5, 2026 06:43

pull-request-size Bot added the size/L label Jun 5, 2026

github-actions Bot added the feat label Jun 5, 2026

dynamo-ops reviewed Jun 5, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB June 10, 2026 22:43 Inactive

nnshah1 enabled auto-merge (squash) June 10, 2026 22:46

copy-pr-bot Bot temporarily deployed to GITLAB June 10, 2026 23:20 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(self-host): auto-clean metadata-registry on detach (gh-8749)#10351

feat(self-host): auto-clean metadata-registry on detach (gh-8749)#10351
nnshah1 wants to merge 2 commits into
mainfrom
nnshah1/gh-8749-auto-detach-cleanup

nnshah1 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Uh oh!

dynamo-ops Jun 5, 2026

Uh oh!

dynamo-ops Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nnshah1 commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Fix

Details

Test

Out of scope

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Walkthrough

Changes

Uh oh!

dynamo-ops Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nnshah1 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading