fix(agent): unregister service endpoints on metal process delete by matiasinsaurralde · Pull Request #168 · defilantech/LLMKube

matiasinsaurralde · 2026-02-22T02:30:00Z

What

Add best-effort endpoint cleanup in the Metal agent delete flow by calling UnregisterEndpoint both before and after process stop.
Make UnregisterEndpoint idempotent by ignoring Kubernetes NotFound errors for Service and Endpoints deletion.
Improve reliability of Metal service teardown to reduce stale networking artifacts (including lingering EndpointSlices).

Why

Deleting a Metal-backed inference service previously stopped the process but could leave Kubernetes networking objects behind.
Stale endpoint state causes confusing service discovery behavior and makes users think deletion did not fully work.
Idempotent cleanup is required for reconcile/retry safety, where delete paths may run multiple times.

Fixes #167

How

Updated deleteProcess in pkg/agent/agent.go to run endpoint cleanup as a best-effort sequence:
1. Parse namespace/name from the internal process key.
2. Call registry.UnregisterEndpoint(ctx, namespace, name) before stopping the process.
3. Stop the local llama.cpp process.
4. Call UnregisterEndpoint again after stop to handle timing/race windows.
5. Aggregate errors from all steps so we surface failures without skipping later cleanup attempts.
Updated UnregisterEndpoint in pkg/agent/registry.go to be idempotent:
- Service delete now ignores apierrors.IsNotFound.
- Endpoints delete now ignores apierrors.IsNotFound.
- Other API errors are still returned (RBAC, connectivity, etc.).
Design choices:
- Pre-stop + post-stop cleanup was chosen over a single call to improve robustness against transient ordering/race issues in endpoint reconciliation.
- Idempotent unregister ensures repeated delete/reconcile paths are safe and do not fail on already-removed resources.

Checklist

Tests added/updated - Note: Don't know if we have tests for this in Metal and/or how complex it would be to cover this.
make test passes locally
make lint passes locally
Commit messages follow conventional commits
All commits are signed off (git commit -s) per DCO
Documentation updated (if user-facing change)

Defilan

Thanks for the submission here! Just a few questions/suggestions.

Also, thanks for filing the issue (#167) with detailed repro steps before submitting the fix, that's really helpful! One thing to consider: the idempotent UnregisterEndpoint change would be pretty easy to cover with a test (call it on an already-deleted resource and verify no error).

Would you be up for adding that? No worries if not, we can follow up separately.

pkg/agent/agent.go

matiasinsaurralde · 2026-03-02T01:41:51Z

I initially added the pre-stop during tests but wasn't able to reproduce a scenario where that's needed. So I've simplified this and kept the post-stop @Defilan 👍

Defilan

Great fix, Matias -- this cleanly addresses the stale endpoint bug from #167 and the idempotent UnregisterEndpoint pattern is exactly right for controller/reconcile safety.

One required change: deleteProcess currently early-returns when StopProcess fails, which skips UnregisterEndpoint. Since the process is already removed from the map at that point, a retry will never clean up the endpoints. Please continue to attempt endpoint cleanup even on StopProcess failure, aggregating both errors.

Also, as mentioned in the first review, a small test for the idempotent UnregisterEndpoint (calling it on already-deleted resources) would be a welcome addition to lock in the contract.

Nice work on the errors.Join / shutdownErrors rename -- clean improvements.

Defilan · 2026-03-02T03:36:05Z

pkg/agent/agent.go

 	a.logger.Infow("stopping inference service", "key", key)
+	namespace, name := parseKey(key)

 	if err := a.executor.StopProcess(process.PID); err != nil {


Bug: If StopProcess fails, this returns immediately and UnregisterEndpoint is never called. However, the process has already been removed from a.processes (line 218), so a retry via the watcher poll loop will see the key as non-existent and skip cleanup entirely -- leaving stale endpoints behind.

Please continue to attempt endpoint cleanup even when StopProcess fails, and aggregate both errors.

Defilan · 2026-03-02T03:36:05Z

pkg/agent/registry.go

@@ -162,7 +163,9 @@ func (r *ServiceRegistry) UnregisterEndpoint(ctx context.Context, namespace, nam
 		},
 	}
 	if err := r.client.Delete(ctx, service); err != nil {


Suggestion: Consider adding a debug-level log line when the Service is already gone (NotFound). This helps with troubleshooting in cases where the agent runs cleanup multiple times.

Signed-off-by: Matías Insaurralde <matias@insaurral.de>

…ocess Signed-off-by: Matías Insaurralde <matias@insaurral.de>

…ency coverage Signed-off-by: Matías Insaurralde <matias@insaurral.de>

Defilan

This is a clean, well-tested fix for #167. The idempotent UnregisterEndpoint pattern is exactly right for controller/reconcile safety, and the error aggregation in deleteProcess ensures endpoint cleanup always runs even when StopProcess fails. Both new tests lock in the behavioral guarantees nicely. All prior review feedback has been addressed. Really appreciate the thorough work here, Matias — great contribution!

One optional follow-up for a future PR: the Shutdown method could also call UnregisterEndpoint for each process during graceful shutdown to avoid stale endpoints on agent restart.

Defilan requested changes Feb 22, 2026

View reviewed changes

pkg/agent/agent.go Outdated Show resolved Hide resolved

pkg/agent/agent.go Show resolved Hide resolved

matiasinsaurralde force-pushed the fix/metal-endpoint-cleanup-on-delete branch from b571f59 to acc9dd3 Compare March 2, 2026 00:55

matiasinsaurralde requested a review from Defilan March 2, 2026 01:41

Defilan requested changes Mar 2, 2026

View reviewed changes

matiasinsaurralde added 3 commits March 2, 2026 18:55

fix(agent): unregister service endpoints on metal process delete

c6a7d40

Signed-off-by: Matías Insaurralde <matias@insaurral.de>

fix(agent): join aggregated cleanup errors for better wrapping

864d27c

Signed-off-by: Matías Insaurralde <matias@insaurral.de>

fix(agent): remove redundant pre-stop endpoint unregister in deletePr…

aac8b9d

…ocess Signed-off-by: Matías Insaurralde <matias@insaurral.de>

matiasinsaurralde force-pushed the fix/metal-endpoint-cleanup-on-delete branch from 3b5f0a8 to aac8b9d Compare March 2, 2026 21:55

fix(agent): continue endpoint cleanup on stop failure and add idempot…

ffc8e3d

…ency coverage Signed-off-by: Matías Insaurralde <matias@insaurral.de>

matiasinsaurralde requested a review from Defilan March 2, 2026 22:30

Defilan approved these changes Mar 4, 2026

View reviewed changes

Defilan merged commit 147b9bc into defilantech:main Mar 4, 2026
15 checks passed

This was referenced Mar 2, 2026

chore: release 0.5.0 #191

Merged

chore: release 0.4.22 #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): unregister service endpoints on metal process delete#168

fix(agent): unregister service endpoints on metal process delete#168
Defilan merged 4 commits intodefilantech:mainfrom
matiasinsaurralde:fix/metal-endpoint-cleanup-on-delete

matiasinsaurralde commented Feb 22, 2026

Uh oh!

Defilan left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

matiasinsaurralde commented Mar 2, 2026

Uh oh!

Defilan left a comment

Uh oh!

Defilan Mar 2, 2026

Uh oh!

Defilan Mar 2, 2026

Uh oh!

Defilan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matiasinsaurralde commented Feb 22, 2026

What

Why

How

Checklist

Uh oh!

Defilan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matiasinsaurralde commented Mar 2, 2026

Uh oh!

Defilan left a comment

Choose a reason for hiding this comment

Uh oh!

Defilan Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Defilan Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Defilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Defilan left a comment •

edited

Loading