Skip to content

fix(agent): unregister service endpoints on metal process delete#168

Merged
Defilan merged 4 commits intodefilantech:mainfrom
matiasinsaurralde:fix/metal-endpoint-cleanup-on-delete
Mar 4, 2026
Merged

fix(agent): unregister service endpoints on metal process delete#168
Defilan merged 4 commits intodefilantech:mainfrom
matiasinsaurralde:fix/metal-endpoint-cleanup-on-delete

Conversation

@matiasinsaurralde
Copy link
Contributor

What

  • Add best-effort endpoint cleanup in the Metal agent delete flow by calling UnregisterEndpoint both before and after process stop.
  • Make UnregisterEndpoint idempotent by ignoring Kubernetes NotFound errors for Service and Endpoints deletion.
  • Improve reliability of Metal service teardown to reduce stale networking artifacts (including lingering EndpointSlices).

Why

  • Deleting a Metal-backed inference service previously stopped the process but could leave Kubernetes networking objects behind.
  • Stale endpoint state causes confusing service discovery behavior and makes users think deletion did not fully work.
  • Idempotent cleanup is required for reconcile/retry safety, where delete paths may run multiple times.

Fixes #167

How

  • Updated deleteProcess in pkg/agent/agent.go to run endpoint cleanup as a best-effort sequence:

    1. Parse namespace/name from the internal process key.
    2. Call registry.UnregisterEndpoint(ctx, namespace, name) before stopping the process.
    3. Stop the local llama.cpp process.
    4. Call UnregisterEndpoint again after stop to handle timing/race windows.
    5. Aggregate errors from all steps so we surface failures without skipping later cleanup attempts.
  • Updated UnregisterEndpoint in pkg/agent/registry.go to be idempotent:

    • Service delete now ignores apierrors.IsNotFound.
    • Endpoints delete now ignores apierrors.IsNotFound.
    • Other API errors are still returned (RBAC, connectivity, etc.).
  • Design choices:

    • Pre-stop + post-stop cleanup was chosen over a single call to improve robustness against transient ordering/race issues in endpoint reconciliation.
    • Idempotent unregister ensures repeated delete/reconcile paths are safe and do not fail on already-removed resources.

Checklist

  • Tests added/updated - Note: Don't know if we have tests for this in Metal and/or how complex it would be to cover this.
  • make test passes locally
  • make lint passes locally
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • Documentation updated (if user-facing change)

Copy link
Member

@Defilan Defilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission here! Just a few questions/suggestions.

Also, thanks for filing the issue (#167) with detailed repro steps before submitting the fix, that's really helpful! One thing to consider: the idempotent UnregisterEndpoint change would be pretty easy to cover with a test (call it on an already-deleted resource and verify no error).

Would you be up for adding that? No worries if not, we can follow up separately.

@matiasinsaurralde matiasinsaurralde force-pushed the fix/metal-endpoint-cleanup-on-delete branch from b571f59 to acc9dd3 Compare March 2, 2026 00:55
@matiasinsaurralde
Copy link
Contributor Author

I initially added the pre-stop during tests but wasn't able to reproduce a scenario where that's needed. So I've simplified this and kept the post-stop @Defilan 👍

Copy link
Member

@Defilan Defilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix, Matias -- this cleanly addresses the stale endpoint bug from #167 and the idempotent UnregisterEndpoint pattern is exactly right for controller/reconcile safety.

One required change: deleteProcess currently early-returns when StopProcess fails, which skips UnregisterEndpoint. Since the process is already removed from the map at that point, a retry will never clean up the endpoints. Please continue to attempt endpoint cleanup even on StopProcess failure, aggregating both errors.

Also, as mentioned in the first review, a small test for the idempotent UnregisterEndpoint (calling it on already-deleted resources) would be a welcome addition to lock in the contract.

Nice work on the errors.Join / shutdownErrors rename -- clean improvements.

a.logger.Infow("stopping inference service", "key", key)
namespace, name := parseKey(key)

if err := a.executor.StopProcess(process.PID); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: If StopProcess fails, this returns immediately and UnregisterEndpoint is never called. However, the process has already been removed from a.processes (line 218), so a retry via the watcher poll loop will see the key as non-existent and skip cleanup entirely -- leaving stale endpoints behind.

Please continue to attempt endpoint cleanup even when StopProcess fails, and aggregate both errors.

@@ -162,7 +163,9 @@ func (r *ServiceRegistry) UnregisterEndpoint(ctx context.Context, namespace, nam
},
}
if err := r.client.Delete(ctx, service); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Consider adding a debug-level log line when the Service is already gone (NotFound). This helps with troubleshooting in cases where the agent runs cleanup multiple times.

Signed-off-by: Matías Insaurralde <matias@insaurral.de>
Signed-off-by: Matías Insaurralde <matias@insaurral.de>
…ocess

Signed-off-by: Matías Insaurralde <matias@insaurral.de>
@matiasinsaurralde matiasinsaurralde force-pushed the fix/metal-endpoint-cleanup-on-delete branch from 3b5f0a8 to aac8b9d Compare March 2, 2026 21:55
…ency coverage

Signed-off-by: Matías Insaurralde <matias@insaurral.de>
Copy link
Member

@Defilan Defilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clean, well-tested fix for #167. The idempotent UnregisterEndpoint pattern is exactly right for controller/reconcile safety, and the error aggregation in deleteProcess ensures endpoint cleanup always runs even when StopProcess fails. Both new tests lock in the behavioral guarantees nicely. All prior review feedback has been addressed. Really appreciate the thorough work here, Matias — great contribution!

One optional follow-up for a future PR: the Shutdown method could also call UnregisterEndpoint for each process during graceful shutdown to avoid stale endpoints on agent restart.

@Defilan Defilan merged commit 147b9bc into defilantech:main Mar 4, 2026
15 checks passed
This was referenced Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Endpoints aren't cleared after service deletion

2 participants