feat(self-host): auto-clean metadata-registry on detach (gh-8749)#10351
feat(self-host): auto-clean metadata-registry on detach (gh-8749)#10351nnshah1 wants to merge 2 commits into
Conversation
`MetadataArtifactRegistry` accumulated entries on every register but had no caller-reachable cleanup path: `clear_self_hosted_artifacts` needed a `LocalModel` handle, and the static `detach_from_endpoint` helper that the Python `unregister_model` binding calls doesn't have one. Long-running workers leaked entries on every LoRA detach or model reload — the documented blocker on flipping `DYN_SELF_HOST_METADATA` default-on. Fix: tag each registration with `Owner = (connection_id, lora_slug)`, the pair `detach_from_endpoint` already has. The registry keeps a secondary `owners -> (slug, suffix)` map; `unregister_for_owner` reads it and calls the existing `unregister` so there is no duplicated cleanup logic. HTTP `get` is unchanged — no new locks on the read path. The previous `attached_self_host_suffix` field and the never-called `clear_self_hosted_artifacts` method are deleted. Tests: existing 2 metadata_registry tests updated to the new `register` signature; new `unregister_for_owner_clears_only_that_owner` asserts LoRA detach leaves the base intact and is idempotent. Signed-off-by: nnshah1 <neelays@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis PR refactors metadata artifact registration to use owner-scoped tracking. ChangesOwner-scoped metadata artifact registration
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| /// Cloning shares the underlying map. | ||
| /// `(connection_id, lora_slug)` — identifies what `detach_from_endpoint` | ||
| /// has on hand. `None` lora_slug = base model. | ||
| pub type Owner = (u64, Option<String>); |
There was a problem hiding this comment.
Owner is only (connection_id, lora_slug), but one runtime can register multiple endpoints and repeated base/LoRA slugs, so later registrations overwrite the owner index and detaching one endpoint can leak its artifacts or unregister artifacts still used by another endpoint. Fix: make the owner include the endpoint identity and reference-count/shared-owner track (slug, suffix) entries so they are deleted only after the last owner detaches.
🤖 AI Fix
Update lib/runtime/src/metadata_registry.rs so Owner is a struct/key containing connection_id, namespace, component, endpoint, and lora_slug, store owner references per (slug, suffix) instead of a single Owner -> (slug, suffix) value, and make unregister_for_owner remove entries for a (slug, suffix) only when no remaining owner references it; update LocalModel::move_to_self_host and LocalModel::detach_from_endpoint in lib/llm/src/local_model.rs to build this owner from endpoint.id(), drt.connection_id(), and the slugified LoRA name.
| // No-op when self-host wasn't enabled or when DYN_SYSTEM_PORT was | ||
| // unset; matches `move_to_self_host`'s `(connection_id, lora_slug)` | ||
| // owner key. | ||
| drt.metadata_artifacts() |
There was a problem hiding this comment.
detach_from_endpoint removes local self-hosted metadata before discovery.unregister succeeds, so any transient unregister failure leaves a still-advertised model card whose /v1/metadata/... URLs now 404. Fix: perform unregister_for_owner only after discovery.unregister(instance).await? returns successfully, or restore the registry entries on failure.
🤖 AI Fix
In lib/llm/src/local_model.rs, move the drt.metadata_artifacts().unregister_for_owner(...) call in LocalModel::detach_from_endpoint to immediately after discovery.unregister(instance).await?, preserving the owner key before model_suffix is moved into DiscoveryInstance.
Address graham-code-review feedback on PR #10351: - Drop the secondary `owners` map; store `Owner = (instance_id, lora_slug)` inline with each entry value. One lock, one source of truth, no nested-write-lock hazard, no two-map sync risk. - `register` takes `&Owner` (one clone inside, not per-file). - Panic on collision: re-registering the same (slug, suffix, filename) with a different owner is a programming error (two attaches of the same model+suffix in one process would let detach-#1 wipe files detach-#2 still needs). Same-owner re-register is fine and just updates the path. - Doc + local var naming aligned on `instance_id` to match `local_model.rs`'s existing usage (the value populates `DiscoveryInstance::Model.instance_id`). - Tests: collision panic + same-owner update path coverage. Signed-off-by: nnshah1 <neelays@nvidia.com>
Overview
MetadataArtifactRegistrywas a write-only sink: everymove_to_self_hostadded entries viaregister, but nothing ever removed them on detach. The only cleanup method on the worker side (LocalModel::clear_self_hosted_artifacts) required aLocalModelhandle that the staticLocalModel::detach_from_endpointhelper — and through it, the Pythonunregister_modelbinding — doesn't have. Long-running workers leaked one entry per metadata file on every LoRA detach or model reload. This is the documented blocker on flippingDYN_SELF_HOST_METADATAdefault-on in the gh-8749 capstone.Fix
Tag each registration with
Owner = (connection_id, lora_slug)— the pairdetach_from_endpointalready has on hand without an API change to the Python side. The registry keeps a secondaryowners -> (slug, suffix)map populated byregister. Newunregister_for_owner(&owner)looks up that map and calls the existingunregisterso there is no duplicated cleanup logic. The HTTPgetread path is unchanged — no new locks on the hot path.Details
lib/runtime/src/metadata_registry.rs: addedpub type Owner = (u64, Option<String>), newownersmap field, newunregister_for_ownermethod,registersignature gains a leadingOwnerparameter (single caller in the workspace — no external breakage). Existingregister/unregister/getsemantics preserved.lib/llm/src/local_model.rs:move_to_self_hostnow takes&Endpointinstead of&DistributedRuntime(one extra line to derivedrt) and passes(drt.connection_id(), model_suffix.map(str::to_string))as the owner.detach_from_endpointcallsunregister_for_ownerbefore the discovery unregister — no-op when self-host was disabled or skipped. Deleted the never-calledclear_self_hosted_artifactsmethod and the supportingattached_self_host_suffix: Option<String>field that only tracked one suffix perLocalModelanyway.Test
metadata_registrytests updated to the newregistersignature (semantics unchanged).unregister_for_owner_clears_only_that_owner: registers one base + one LoRA from the same connection_id, detaches the LoRA viaunregister_for_owner, asserts the LoRA entries are gone and base entries remain. Second call is a no-op (idempotent).dynamo-llm+dynamo-runtimelib test suites, clippy with-D warnings, andcargo fmt --checkall green.Out of scope
unregister_modelAPI is unchanged — callers incomponents/src/dynamo/vllm/handlers.pydo not need to pass a model name.DYN_SELF_HOST_METADATA— still gated on this PR plus a multimodal positive e2e; tracked in the DEP (light): Worker self-hosted metadata files #8749 capstone.Related
slug_dir(PR4a)Summary by CodeRabbit