Skip to content

Fix Endpoint delete request processing during agent startup#40568

Merged
joestringer merged 11 commits intocilium:mainfrom
fristonio:pr/fristonio/fix/cni-delete
Aug 13, 2025
Merged

Fix Endpoint delete request processing during agent startup#40568
joestringer merged 11 commits intocilium:mainfrom
fristonio:pr/fristonio/fix/cni-delete

Conversation

@fristonio
Copy link
Copy Markdown
Member

@fristonio fristonio commented Jul 18, 2025

This PR fixes an issue in the Endpoint API's delete request
processing during cilium-agent startup.
Currently, when cilium-agent boots up, the API server is started first
which allows CNI to interact with the endpoint subsystem. This exposes
endpoint APIs like Put, Delete to CNI plugin without waiting for
initial state restore to complete.
This leads to a stituation where CNI plugin can request Delete for an
endpoint that is not yet restored and exposed to EndpointManager. In
such cases cilium-agent will respond with NotFound error to CNI.
Currently, the CNI plugin ignores all errors in the delete path and
continues to clean up corresponding resources for the endpoint(eg.Link),
reporting success back to kubelet. When eventually endpoint restore
finishes regeneration fails for these endpoints, because the underlying
resources(eg. Link) no longer exist.

This patch fixes the issue by coordinating endpoint restore handling
with API requests processing for endpoint subsystem. With this patch,
the API server responds with a special 503(Service Unavailble) http
error for requests when the state restore is in progress. CNI
appropriately handles this error code by persisting the delete request
on the disk before cleaning up the associated resources. Once the state
restore is complete, this persisted deletion requests are replayed and
all pending deletes are processed.

With this patch the end-to-end flow for interaction between CNI plugin
and agent during startup behaves as follows:

  • Agent is down:
    • Connection to API server fails. CNI plugin persists DELs on disk.
  • Agent is up; API server is started:
    • Server responds with 503 for endpoint API requests.
    • CNI processes and persists DELs on disk; ADDs are rejected.
  • Agent is up; API Server is up; Bootstrap in progress
    • Endpoint state is restored from the disk and exposed to EndpointManager.
  • Agent is up; API Server is up; Bootstrap completed:
    • API accepts DELs but still rejects ADDs
    • Deletion Queue handler processes persisted DELs from disk.
  • Agent is up; API Server is up; Bootstrap completed; DeletionQueue drained:
    • API is fully functional
  • Regeneration is triggered for all alive restored endpoints.

Fixes: #39370

Fix a bug where cilium-agent would report "Link not found" for an endpoint deleted during state restore after cilium-agent restart.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 18, 2025
@fristonio fristonio added kind/bug This is a bug in the Cilium logic. area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. release-note/bug This PR fixes an issue in a previous release of Cilium. area/agent Cilium agent related. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 18, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 18, 2025
@fristonio fristonio force-pushed the pr/fristonio/fix/cni-delete branch 2 times, most recently from 515a421 to bcbf55a Compare July 18, 2025 17:14
@fristonio
Copy link
Copy Markdown
Member Author

/test

@fristonio fristonio force-pushed the pr/fristonio/fix/cni-delete branch from bcbf55a to 11cb5c4 Compare July 21, 2025 07:29
@fristonio
Copy link
Copy Markdown
Member Author

/test

@fristonio fristonio marked this pull request as ready for review July 21, 2025 16:30
@fristonio fristonio requested review from a team as code owners July 21, 2025 16:30
@fristonio fristonio force-pushed the pr/fristonio/fix/cni-delete branch 2 times, most recently from 64abb6b to 285d8e3 Compare July 22, 2025 16:48
@joestringer
Copy link
Copy Markdown
Member

Would it make sense to clarify the release note to make it more obvious to users what the change in behavior is? Below is a suggestion from my current understanding of the impact:

Fix a bug where cilium-agent would report "Link not found" for a Pod that has been deleted during cilium-agent restart

@fristonio
Copy link
Copy Markdown
Member Author

Fix a bug where cilium-agent would report "Link not found" for a Pod that has been deleted during cilium-agent restart

Yeah, this sounds good. Fixed the release note with a minor update to attribute the issue to cilium-agent state restore after restart instead of just restart.

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a very rough initial review pass but I didn't think deeply about the solution in this review. There's a few comments below of things that stood out.

@squeed
Copy link
Copy Markdown
Contributor

squeed commented Jul 23, 2025

Lets document the exact timing here somewhere in a README.md, so we can make it more explicit.

Here's how it's supposed to work, back in #23486:

  1. Agent goes down
  2. CNI plugin fails to connect to gRPC socket. It takes a lockfile and writes a queued deletion
  3. Agent comes back up. It takes an exclusive lock on the lockfile and replays the deletion queue
  4. Agent starts the REST server
  5. Agent drops the lock
  6. Any blocked CNI plugin processes obtain the lock and retry the REST request, on the off chance the agent was starting up

To be consistent, we need to hold the delete queue lock while we transition from unavailable -> available, and the plugin needs to retry. So the ordering is something like

  1. The REST API starts up, but rejects all endpoint requests
  2. Agent takes the lock and replays all queued deletions
  3. Endpoint API transitions to available
  4. Agent drops the lock

Likewise, the CNI plugin needs to:

  1. Connect and try the Delete request. If that fails...
  2. Take the lock
  3. Re-try connecting and the delete request
  4. Fall back to queued delete.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This commit simplifies the DeletionFallbackClient usage and handling of
endpoint deletion requests. With this patch the constructor now returns
the client object without performing any initialization.
The `EndpointDeleteMany` method is directly used by callers to request
endpoint deletion which is processed according to below flow:

```mermaid
flowchart TD
    A[DeletionFallbackClient] -->|EndpointDeleteMany| B[Get Or Connect Cilium API]
    B -->|Success| D[Request Endpoint Delete]
    B -->|Failure| E{DeletionQueue Lock}
    D -->|Success| OK{Return OK}
    D -->|Failure| E
    E -->|Acquired| F[Persist Deletion Request]
    E -->|NotAcquired| L[AcquireLock]
    L -->|Success| B
    L -->|Failed| NotOK
    F -->|Success| OK{Return OK}
    F -->|Failed| NotOK{Return Error}
```

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This commit adds a readiness precheck to certain endpoint subsystem APIs
like Put, Delete, Patch. This check ensures that endpoint APIs are not
exposed to external components like CNI till all dependencies are ready.
Currently this includes:

1. Endpoint Delete
  * Endpoint Delete APIs are only exposed once the state restore is complete.
  This ensures that all delete operations always see full state of endpoint
  manager so as to avoid any missed deletes.

2. Endpoint Put/Patch
  * Endpoint Update APIs are guarded by DeletionQueue fence. These apis are
  only exposed once the offline deletion queue is processed by the agent.
  Since the processing of endpoint Delete and Add operations is async,
  this check ensures that we don't delete an active endpoint if there is a
  delay in replaying the offline delete queue.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
…tion

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This commit changes how other components(daemon) waits on initial policy
computation for an endpoint. Instead of directly relying on the
`InitialEnvoyPolicyComputed` channel in an endpoint object, we now expose
a method for existing callers to block on initial policy computation.
`WaitForInitialPolicy` method on endpoint object waits on either initial
policy to be computed or endpoint to be deleted. This makes sure that
callers don't wait on policy computation indefinitely in cases where
endpoint is deleted before initial regeneration is completed.

This fixes an issue in cilium during endpoint restore where a restored
endpoint is deleted before regeneration when processing the offline
deletion queue.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
This reverts commit 62c81af.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
@fristonio fristonio force-pushed the pr/fristonio/fix/cni-delete branch from 1d6c7f0 to 5500460 Compare August 12, 2025 16:51
@fristonio
Copy link
Copy Markdown
Member Author

fristonio commented Aug 12, 2025

/test
K8s NetworkPolicy E2E tests - #39968

@fristonio fristonio removed the request for review from squeed August 13, 2025 21:07
@joestringer
Copy link
Copy Markdown
Member

Thanks again for all your work on this @fristonio 🙏 Merging now.

@joestringer joestringer merged commit 73a403b into cilium:main Aug 13, 2025
67 of 70 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Aug 13, 2025
@joamaki joamaki mentioned this pull request Aug 19, 2025
19 tasks
@joamaki joamaki added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 19, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Cilium agent related. area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. backport/author The backport will be carried out by the author of the PR. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

Endpoint restoration/regeneration of completed job stuck in infinite loop (error="retrieving device lxc*: Link not found")

10 participants