Skip to content

feat(self-host): auto-clean metadata-registry on detach (gh-8749)#10351

Open
nnshah1 wants to merge 2 commits into
mainfrom
nnshah1/gh-8749-auto-detach-cleanup
Open

feat(self-host): auto-clean metadata-registry on detach (gh-8749)#10351
nnshah1 wants to merge 2 commits into
mainfrom
nnshah1/gh-8749-auto-detach-cleanup

Conversation

@nnshah1

@nnshah1 nnshah1 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Overview

MetadataArtifactRegistry was a write-only sink: every move_to_self_host added entries via register, but nothing ever removed them on detach. The only cleanup method on the worker side (LocalModel::clear_self_hosted_artifacts) required a LocalModel handle that the static LocalModel::detach_from_endpoint helper — and through it, the Python unregister_model binding — doesn't have. Long-running workers leaked one entry per metadata file on every LoRA detach or model reload. This is the documented blocker on flipping DYN_SELF_HOST_METADATA default-on in the gh-8749 capstone.

Fix

Tag each registration with Owner = (connection_id, lora_slug) — the pair detach_from_endpoint already has on hand without an API change to the Python side. The registry keeps a secondary owners -> (slug, suffix) map populated by register. New unregister_for_owner(&owner) looks up that map and calls the existing unregister so there is no duplicated cleanup logic. The HTTP get read path is unchanged — no new locks on the hot path.

Details

  • lib/runtime/src/metadata_registry.rs: added pub type Owner = (u64, Option<String>), new owners map field, new unregister_for_owner method, register signature gains a leading Owner parameter (single caller in the workspace — no external breakage). Existing register / unregister / get semantics preserved.
  • lib/llm/src/local_model.rs: move_to_self_host now takes &Endpoint instead of &DistributedRuntime (one extra line to derive drt) and passes (drt.connection_id(), model_suffix.map(str::to_string)) as the owner. detach_from_endpoint calls unregister_for_owner before the discovery unregister — no-op when self-host was disabled or skipped. Deleted the never-called clear_self_hosted_artifacts method and the supporting attached_self_host_suffix: Option<String> field that only tracked one suffix per LocalModel anyway.

Test

  • Existing two metadata_registry tests updated to the new register signature (semantics unchanged).
  • New unregister_for_owner_clears_only_that_owner: registers one base + one LoRA from the same connection_id, detaches the LoRA via unregister_for_owner, asserts the LoRA entries are gone and base entries remain. Second call is a no-op (idempotent).
  • Full dynamo-llm + dynamo-runtime lib test suites, clippy with -D warnings, and cargo fmt --check all green.

Out of scope

  • The Python unregister_model API is unchanged — callers in components/src/dynamo/vllm/handlers.py do not need to pass a model name.
  • Default-on flip of DYN_SELF_HOST_METADATA — still gated on this PR plus a multimodal positive e2e; tracked in the DEP (light): Worker self-hosted metadata files #8749 capstone.

Related

Summary by CodeRabbit

  • Refactor
    • Improved metadata registry for self-hosted model instances using enhanced owner-based tracking
    • Streamlined artifact lifecycle management and cleanup for local model deployments

`MetadataArtifactRegistry` accumulated entries on every register but
had no caller-reachable cleanup path: `clear_self_hosted_artifacts`
needed a `LocalModel` handle, and the static `detach_from_endpoint`
helper that the Python `unregister_model` binding calls doesn't have
one. Long-running workers leaked entries on every LoRA detach or
model reload — the documented blocker on flipping
`DYN_SELF_HOST_METADATA` default-on.

Fix: tag each registration with `Owner = (connection_id, lora_slug)`,
the pair `detach_from_endpoint` already has. The registry keeps a
secondary `owners -> (slug, suffix)` map; `unregister_for_owner` reads
it and calls the existing `unregister` so there is no duplicated
cleanup logic. HTTP `get` is unchanged — no new locks on the read
path. The previous `attached_self_host_suffix` field and the
never-called `clear_self_hosted_artifacts` method are deleted.

Tests: existing 2 metadata_registry tests updated to the new
`register` signature; new `unregister_for_owner_clears_only_that_owner`
asserts LoRA detach leaves the base intact and is idempotent.

Signed-off-by: nnshah1 <neelays@nvidia.com>
@nnshah1 nnshah1 requested a review from a team as a code owner June 5, 2026 06:43
@github-actions github-actions Bot added the feat label Jun 5, 2026
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 68e0fdc6-ef9e-4e0c-8387-8497c22fe536

📥 Commits

Reviewing files that changed from the base of the PR and between 7d5ca77 and 7524b19.

📒 Files selected for processing (2)
  • lib/llm/src/local_model.rs
  • lib/runtime/src/metadata_registry.rs

Walkthrough

This PR refactors metadata artifact registration to use owner-scoped tracking. MetadataArtifactRegistry now associates artifacts with an explicit Owner tuple (connection_id, lora_slug), supports per-owner cleanup via unregister_for_owner, and LocalModel integrates the new API by computing the owner during self-host rewrite and using it to register and clean up metadata.

Changes

Owner-scoped metadata artifact registration

Layer / File(s) Summary
Metadata registry owner-scoped contract and implementation
lib/runtime/src/metadata_registry.rs
MetadataArtifactRegistry defines Owner = (u64, Option<String>), adds internal owners index mapping owners to (slug, suffix) pairs, updates register signature to include owner parameter, adds unregister_for_owner(&Owner) method for owner-scoped cleanup, and tests are updated to register with owners and verify selective removal by owner.
LocalModel integration with owner-based metadata registration
lib/llm/src/local_model.rs
LocalModel removes the attached_self_host_suffix field; attach passes the full Endpoint to move_to_self_host; move_to_self_host computes owner from connection_id and model_suffix, then calls register(owner, ...) with the owner parameter; detach_from_endpoint calls unregister_for_owner(&(instance_id, model_suffix)) to clean up owner-scoped metadata entries.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: introducing automatic cleanup of metadata-registry entries on model detach, addressing the ownership tracking and unregistering mechanism described throughout the PR.
Description check ✅ Passed The description comprehensively covers all template sections: Overview clearly states the problem and solution, Details enumerate specific file changes and their rationale, Test describes validation approach, and Related Issues are linked appropriately.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

/// Cloning shares the underlying map.
/// `(connection_id, lora_slug)` — identifies what `detach_from_endpoint`
/// has on hand. `None` lora_slug = base model.
pub type Owner = (u64, Option<String>);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Owner is only (connection_id, lora_slug), but one runtime can register multiple endpoints and repeated base/LoRA slugs, so later registrations overwrite the owner index and detaching one endpoint can leak its artifacts or unregister artifacts still used by another endpoint. Fix: make the owner include the endpoint identity and reference-count/shared-owner track (slug, suffix) entries so they are deleted only after the last owner detaches.

🤖 AI Fix

Update lib/runtime/src/metadata_registry.rs so Owner is a struct/key containing connection_id, namespace, component, endpoint, and lora_slug, store owner references per (slug, suffix) instead of a single Owner -> (slug, suffix) value, and make unregister_for_owner remove entries for a (slug, suffix) only when no remaining owner references it; update LocalModel::move_to_self_host and LocalModel::detach_from_endpoint in lib/llm/src/local_model.rs to build this owner from endpoint.id(), drt.connection_id(), and the slugified LoRA name.

// No-op when self-host wasn't enabled or when DYN_SYSTEM_PORT was
// unset; matches `move_to_self_host`'s `(connection_id, lora_slug)`
// owner key.
drt.metadata_artifacts()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detach_from_endpoint removes local self-hosted metadata before discovery.unregister succeeds, so any transient unregister failure leaves a still-advertised model card whose /v1/metadata/... URLs now 404. Fix: perform unregister_for_owner only after discovery.unregister(instance).await? returns successfully, or restore the registry entries on failure.

🤖 AI Fix

In lib/llm/src/local_model.rs, move the drt.metadata_artifacts().unregister_for_owner(...) call in LocalModel::detach_from_endpoint to immediately after discovery.unregister(instance).await?, preserving the owner key before model_suffix is moved into DiscoveryInstance.

Address graham-code-review feedback on PR #10351:

- Drop the secondary `owners` map; store `Owner = (instance_id,
  lora_slug)` inline with each entry value. One lock, one source of
  truth, no nested-write-lock hazard, no two-map sync risk.
- `register` takes `&Owner` (one clone inside, not per-file).
- Panic on collision: re-registering the same (slug, suffix, filename)
  with a different owner is a programming error (two attaches of the
  same model+suffix in one process would let detach-#1 wipe files
  detach-#2 still needs). Same-owner re-register is fine and just
  updates the path.
- Doc + local var naming aligned on `instance_id` to match
  `local_model.rs`'s existing usage (the value populates
  `DiscoveryInstance::Model.instance_id`).
- Tests: collision panic + same-owner update path coverage.

Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants