Fix Endpoint delete request processing during agent startup#40568
Fix Endpoint delete request processing during agent startup#40568joestringer merged 11 commits intocilium:mainfrom
Conversation
515a421 to
bcbf55a
Compare
|
/test |
bcbf55a to
11cb5c4
Compare
|
/test |
64abb6b to
285d8e3
Compare
|
Would it make sense to clarify the release note to make it more obvious to users what the change in behavior is? Below is a suggestion from my current understanding of the impact: |
Yeah, this sounds good. Fixed the release note with a minor update to attribute the issue to |
joestringer
left a comment
There was a problem hiding this comment.
I did a very rough initial review pass but I didn't think deeply about the solution in this review. There's a few comments below of things that stood out.
|
Lets document the exact timing here somewhere in a README.md, so we can make it more explicit. Here's how it's supposed to work, back in #23486:
To be consistent, we need to hold the delete queue lock while we transition from unavailable -> available, and the plugin needs to retry. So the ordering is something like
Likewise, the CNI plugin needs to:
|
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This commit simplifies the DeletionFallbackClient usage and handling of
endpoint deletion requests. With this patch the constructor now returns
the client object without performing any initialization.
The `EndpointDeleteMany` method is directly used by callers to request
endpoint deletion which is processed according to below flow:
```mermaid
flowchart TD
A[DeletionFallbackClient] -->|EndpointDeleteMany| B[Get Or Connect Cilium API]
B -->|Success| D[Request Endpoint Delete]
B -->|Failure| E{DeletionQueue Lock}
D -->|Success| OK{Return OK}
D -->|Failure| E
E -->|Acquired| F[Persist Deletion Request]
E -->|NotAcquired| L[AcquireLock]
L -->|Success| B
L -->|Failed| NotOK
F -->|Success| OK{Return OK}
F -->|Failed| NotOK{Return Error}
```
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This commit adds a readiness precheck to certain endpoint subsystem APIs like Put, Delete, Patch. This check ensures that endpoint APIs are not exposed to external components like CNI till all dependencies are ready. Currently this includes: 1. Endpoint Delete * Endpoint Delete APIs are only exposed once the state restore is complete. This ensures that all delete operations always see full state of endpoint manager so as to avoid any missed deletes. 2. Endpoint Put/Patch * Endpoint Update APIs are guarded by DeletionQueue fence. These apis are only exposed once the offline deletion queue is processed by the agent. Since the processing of endpoint Delete and Add operations is async, this check ensures that we don't delete an active endpoint if there is a delay in replaying the offline delete queue. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
…tion Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This commit changes how other components(daemon) waits on initial policy computation for an endpoint. Instead of directly relying on the `InitialEnvoyPolicyComputed` channel in an endpoint object, we now expose a method for existing callers to block on initial policy computation. `WaitForInitialPolicy` method on endpoint object waits on either initial policy to be computed or endpoint to be deleted. This makes sure that callers don't wait on policy computation indefinitely in cases where endpoint is deleted before initial regeneration is completed. This fixes an issue in cilium during endpoint restore where a restored endpoint is deleted before regeneration when processing the offline deletion queue. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This reverts commit 62c81af. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
1d6c7f0 to
5500460
Compare
|
/test |
|
Thanks again for all your work on this @fristonio 🙏 Merging now. |
This PR fixes an issue in the Endpoint API's delete request
processing during cilium-agent startup.
Currently, when cilium-agent boots up, the API server is started first
which allows CNI to interact with the endpoint subsystem. This exposes
endpoint APIs like
Put,Deleteto CNI plugin without waiting forinitial state restore to complete.
This leads to a stituation where CNI plugin can request Delete for an
endpoint that is not yet restored and exposed to EndpointManager. In
such cases cilium-agent will respond with
NotFounderror to CNI.Currently, the CNI plugin ignores all errors in the delete path and
continues to clean up corresponding resources for the endpoint(eg.Link),
reporting success back to kubelet. When eventually endpoint restore
finishes regeneration fails for these endpoints, because the underlying
resources(eg. Link) no longer exist.
This patch fixes the issue by coordinating endpoint restore handling
with API requests processing for endpoint subsystem. With this patch,
the API server responds with a special 503(Service Unavailble) http
error for requests when the state restore is in progress. CNI
appropriately handles this error code by persisting the delete request
on the disk before cleaning up the associated resources. Once the state
restore is complete, this persisted deletion requests are replayed and
all pending deletes are processed.
With this patch the end-to-end flow for interaction between CNI plugin
and agent during startup behaves as follows:
Fixes: #39370