fix(agent): unregister service endpoints on metal process delete#168
Conversation
There was a problem hiding this comment.
Thanks for the submission here! Just a few questions/suggestions.
Also, thanks for filing the issue (#167) with detailed repro steps before submitting the fix, that's really helpful! One thing to consider: the idempotent UnregisterEndpoint change would be pretty easy to cover with a test (call it on an already-deleted resource and verify no error).
Would you be up for adding that? No worries if not, we can follow up separately.
b571f59 to
acc9dd3
Compare
|
I initially added the pre-stop during tests but wasn't able to reproduce a scenario where that's needed. So I've simplified this and kept the post-stop @Defilan 👍 |
Defilan
left a comment
There was a problem hiding this comment.
Great fix, Matias -- this cleanly addresses the stale endpoint bug from #167 and the idempotent UnregisterEndpoint pattern is exactly right for controller/reconcile safety.
One required change: deleteProcess currently early-returns when StopProcess fails, which skips UnregisterEndpoint. Since the process is already removed from the map at that point, a retry will never clean up the endpoints. Please continue to attempt endpoint cleanup even on StopProcess failure, aggregating both errors.
Also, as mentioned in the first review, a small test for the idempotent UnregisterEndpoint (calling it on already-deleted resources) would be a welcome addition to lock in the contract.
Nice work on the errors.Join / shutdownErrors rename -- clean improvements.
| a.logger.Infow("stopping inference service", "key", key) | ||
| namespace, name := parseKey(key) | ||
|
|
||
| if err := a.executor.StopProcess(process.PID); err != nil { |
There was a problem hiding this comment.
Bug: If StopProcess fails, this returns immediately and UnregisterEndpoint is never called. However, the process has already been removed from a.processes (line 218), so a retry via the watcher poll loop will see the key as non-existent and skip cleanup entirely -- leaving stale endpoints behind.
Please continue to attempt endpoint cleanup even when StopProcess fails, and aggregate both errors.
| @@ -162,7 +163,9 @@ func (r *ServiceRegistry) UnregisterEndpoint(ctx context.Context, namespace, nam | |||
| }, | |||
| } | |||
| if err := r.client.Delete(ctx, service); err != nil { | |||
There was a problem hiding this comment.
Suggestion: Consider adding a debug-level log line when the Service is already gone (NotFound). This helps with troubleshooting in cases where the agent runs cleanup multiple times.
Signed-off-by: Matías Insaurralde <matias@insaurral.de>
Signed-off-by: Matías Insaurralde <matias@insaurral.de>
…ocess Signed-off-by: Matías Insaurralde <matias@insaurral.de>
3b5f0a8 to
aac8b9d
Compare
…ency coverage Signed-off-by: Matías Insaurralde <matias@insaurral.de>
Defilan
left a comment
There was a problem hiding this comment.
This is a clean, well-tested fix for #167. The idempotent UnregisterEndpoint pattern is exactly right for controller/reconcile safety, and the error aggregation in deleteProcess ensures endpoint cleanup always runs even when StopProcess fails. Both new tests lock in the behavioral guarantees nicely. All prior review feedback has been addressed. Really appreciate the thorough work here, Matias — great contribution!
One optional follow-up for a future PR: the Shutdown method could also call UnregisterEndpoint for each process during graceful shutdown to avoid stale endpoints on agent restart.
What
UnregisterEndpointboth before and after process stop.UnregisterEndpointidempotent by ignoring KubernetesNotFounderrors for Service and Endpoints deletion.Why
Fixes #167
How
Updated
deleteProcessinpkg/agent/agent.goto run endpoint cleanup as a best-effort sequence:namespace/namefrom the internal process key.registry.UnregisterEndpoint(ctx, namespace, name)before stopping the process.UnregisterEndpointagain after stop to handle timing/race windows.Updated
UnregisterEndpointinpkg/agent/registry.goto be idempotent:apierrors.IsNotFound.apierrors.IsNotFound.Design choices:
Checklist
make testpasses locallymake lintpasses locallygit commit -s) per DCO