CES: add option to create CES directly from pods by jshr-w · Pull Request #38388 · cilium/cilium

jshr-w · 2025-03-21T03:13:02Z

This PR makes a change to add an option to create CiliumEndpointSlices directly from pods instead of from CiliumEndpoints. This is a step towards eliminating the need for CEPs when CES are present, which will allow us to reduce pressure created on the kube-apiserver by the CEP CRD at high scale.

Descriptions of changes are in each commit.

Add option to create CiliumEndpointSlices directly from pods instead of CiliumEndpoints

jshr-w · 2025-03-25T03:15:07Z

/test

jshr-w · 2025-03-25T05:38:30Z

cc @dlapcevic @ovidiutirla

dlapcevic · 2025-03-25T11:08:42Z

cc @sypakine

qmonnet

Looks OK for doc change, thanks!

giorio94

I took a first look at the PR, and left a couple of initial comments inline. I still need to fully digest all the logic (which is fairly new to me) though, as well as look at the tests.

This PR doesn't allow to disable the creation of CEPs yet (due to [1]), but if that was the case (which I assume being a follow-up step), CES would become incompatible with the clustermesh-apiserver. This component currently only relies on CEP information (which is fine being only a handful of replicas at most). Essentially, this logic [2] should be extended to work with CES as well, starting either one or the other based on the Cilium configuration. IMO that should be addressed before moving further, to avoid introducing yet another incompatibility between features.

FYI, one of the integration tests failures looks legitimate.

[1]:

cilium/pkg/option/config.go

Lines 3242 to 3248 in 81666d2

    
           // Ensure CiliumEndpointSlice is enabled only if CiliumEndpointCRD is enabled too. 
        
           c.EnableCiliumEndpointSlice = vp.GetBool(EnableCiliumEndpointSlice) 
        
           if c.EnableCiliumEndpointSlice && c.DisableCiliumEndpointCRD { 
        
           	log.Fatalf("Running Cilium with %s=%t requires %s set to false to enable CiliumEndpoint CRDs.", 
        
           		EnableCiliumEndpointSlice, c.EnableCiliumEndpointSlice, DisableCiliumEndpointCRDName) 
        
           }

[2]:

cilium/clustermesh-apiserver/clustermesh/root.go

Line 392 in 81666d2

    
           go synchronize(ctx, resources.CiliumSlimEndpoints, newEndpointSynchronizer(ctx, logger, cinfo, backend, factory, syncState.WaitForResource()))

operator/k8s/resources.go

operator/pkg/ciliumendpointslice/reconciler.go

jshr-w · 2025-03-26T02:14:11Z

Thanks for taking a look @giorio94! I will open a PR to ensure compatibility between the clustermesh apiserver and CES is retained with these changes.

pippolo84

Thanks for the PR! 💯

Spotted a couple of leftovers and left a comment about a minor nit inline, but overall LGTM.

operator/pkg/ciliumendpointslice/log.log

operator/pkg/ciliumendpointslice/controller_test.go

operator/pkg/ciliumidentity/testutils/test_utils.go

operator/pkg/ciliumidentity/reconciler.go

operator/pkg/ciliumendpointslice/reconciler.go

Export WireGuard --enable-wireguard flag for use in Operator. CES controller requires --enable-wireguard to set the encryption of the CES. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

Add config option for `ces-controller-mode`. CES Controller can be allowed to run with 'default' mode, as it functions today, or the new mode 'slim', where CES completely replace CEPs in the cluster and CEPs are no longer created. When in default mode, the DefaultController is used, which watches CEPs and creates CES based off those changes. When in slim mode, the new SlimController will be used instead. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

The CES Manager must behave differently in 'default' and 'slim' mode, as we are required to store different state. The default mode will use the existing ceptocesmap as the datastore, while we introduce a new CESCache as the datastore for slim mode since we will need to correlate all the information when we do not rely on CEPs. We separate out the managers for these modes into a DefaultManager and SlimManager, which implement the Manager interface. The reconciler depends on the Manager interface. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

CES Controller in slim mode will watch CiliumNodes to process updates, particularly to the EncryptionKey that is stored in CESs. This requires watching CiliumNodes to handle update and delete. We will track the set of CEPs on a node, so we know which CESs need to be reconciled on a node update. Adds unit tests for node state. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

The Slim CES Controller process CiliumNode updates with an extra handler to pass updates to CiliumNode encryption keys to the local CES cache, so that the affected CESs can be reconciled on changes. Adds unit tests for CES manager node upsert/delete. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

We extend the CESCache to handle updates to CIDs. We will need to associate pods with their identities for creating CES. The SecIds data structure contains a selected ID, which is used if there are multiple ids associated with a label to minimize churn. Adds unit test for cid state. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

Add flow where controller will process CID updates, update the relevant mappings in the manager and local CES cache. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

Allow the local CES datastore to store CES and namespace information. We also introduce the CESCacher interface, to allow the default and slim CES managers to handle their own local cache types. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

Signed-off-by: jshr-w <shjayaraman@microsoft.com>

Add store for pod state to CESCache, along with necessary helper functions. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

We split the reconciler into the default and slim, to pass the appropriate params depending on the configuration initialized. Reconciler requires two new interfaces to handle the default/slim modes, on reconciliation of CES. The slim reconciler will reconcile based on pod state and local CES cache state instead of getting data from the CEP. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

jshr-w · 2025-12-26T22:24:38Z

Leaving a note; I've followed up with most of the changes requested here - there are a few I will kick to a follow-up PR such as adding more tests. I've noted them in the comments here & added them to the associated issue to ensure tracking for their follow-up. This feature is currently behind a hidden flag and partially implemented at this point.

Thanks @giorio94 for your patience and help through the entire review process for this super long piece :)

When a pod is added/updated or deleted, we insert it into the cache. We only attempt to add it to the CES cache if we have sufficient state, otherwise we wait for the necessary info to be ready on the pod. We will add the pod to the cache whether or not it has been assigned an identity -- in the case it has not, we will receive an update when the identity is created and associate it with the pod later. Add unit tests for cescache and for slim manager flow. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

On initialization of the operator and CES controller, we sync CES from the store to our local cache before the controller runs. To do this, we read the existing store state and create the existing CES in our local cache. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

Add start flow for SlimController to run. Modify test helpers for slim mode to be tested. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

jshr-w · 2025-12-30T02:36:52Z

/test

jshr-w · 2026-01-06T17:07:28Z

/ci-eks

giorio94 · 2026-01-07T16:46:19Z

@cilium/operator Gentle ping 🙏

I've also marked this PR as blocked for the moment, due to feature freeze.

jshr-w · 2026-01-20T18:22:15Z

Hello, gentle ping again for a review from @cilium/operator

pippolo84

✅

joestringer · 2026-01-26T17:44:50Z

Shreya pointed out that this was previously reviewed by @pippolo84 for the @cilium/operator group, but he has stepped down from the group so GitHub no longer recognizes that - hence it was stuck waiting for operator review despite the interface not explicitly showing this in the top right corner. I re-requested review from the group to nudge it forward.

joestringer · 2026-01-26T18:16:02Z

I'm seeing that @giorio94 reviewed the PR overall in detail. I'm glad he did, though it does feel like we should have pushed back more for incremental changes. >1K LoC is generally not an acceptable size for a change. I can tell from the comments in Marco's review that he feels obligated not to block the PR due to the sheer size and effort you've put into the PR. The functionality is useful. We definitely appreciate the work. On the other hand, this is how we end up with code we're not happy with in the repo because no-one wants to iterate further on the PR. How do we track and resolve the concerns about the operator getting into a state where it can't recover and properly sync back? Should we have an issue for it, and can we concretely describe what it would look like to resolve those concerns?

jshr-w · 2026-01-26T22:00:31Z

Hi Joe, that's fair to say. My understanding based on previous comments was that it would be okay to break this into smaller commits, but I would have also been willing to (and we can also do this now) break this into smaller PRs if requested.

It is not showing as linked, but I have an issue #38507 where I am tracking the other changes needed for this -- I will flesh some of these items out further.

joestringer · 2026-01-28T01:08:56Z

As long as we're tracking those concerns somehow and reviewers are otherwise happy that this change improves overall code health, we should probably move forward to merge this. Making progress is a healthy step towards improving the code overall (taking inspiration from Google's code review guidelines). @giorio94 does that sound good to you, merge then iterate in the tree?

giorio94 · 2026-01-28T17:23:11Z

Thanks Joe (and obviously @jshr-w for all the work you put here), I'll then go ahead and merge this PR. I think it sets reasonable foundations for the feature (which is currently gated behind a hidden flag), and it seems easier to me to keep iterating afterwards. My suggestion in that respect would be prioritize the scaffolding of the script tests infrastructure and the introduction of a few initial script tests. That should be of great help to figure out some of the possible rough edges, increase confidence, as well as allow to incrementally validate future improvements and fixes.

joestringer · 2026-01-28T18:05:57Z

Thanks for your guidance on this @giorio94 . And congrats @jshr-w for landing this useful feature 🎉

jshr-w · 2026-01-28T19:40:46Z

Thanks Marco again for all your guidance through this PR! I will make sure the updates to testing are handled as the first priority to follow this PR. And thanks Joe for your help :)

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 21, 2025

jshr-w force-pushed the jshr/podtoces branch 5 times, most recently from 3bc44bf to 42456bb Compare March 25, 2025 02:45

jshr-w changed the title ~~[WIP] ces: create ces from pods~~ CES: create CES directly from pods Mar 25, 2025

jshr-w force-pushed the jshr/podtoces branch from 42456bb to c935bfb Compare March 25, 2025 03:14

jshr-w marked this pull request as ready for review March 25, 2025 03:15

jshr-w requested review from a team as code owners March 25, 2025 03:15

jshr-w requested review from giorio94, pippolo84 and qmonnet March 25, 2025 03:15

qmonnet approved these changes Mar 25, 2025

View reviewed changes

qmonnet added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/ces Impacts the Cilium Endpoint Slice logic. labels Mar 25, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 25, 2025

giorio94 requested changes Mar 25, 2025

View reviewed changes

operator/k8s/resources.go Outdated Show resolved Hide resolved

operator/k8s/resources.go Outdated Show resolved Hide resolved

operator/pkg/ciliumendpointslice/reconciler.go Outdated Show resolved Hide resolved

operator/pkg/ciliumendpointslice/reconciler.go Outdated Show resolved Hide resolved

jshr-w force-pushed the jshr/podtoces branch from c935bfb to 4954ab9 Compare March 25, 2025 20:12

This was referenced Mar 26, 2025

Remove CEP CRD when CES enabled #38507

Open

clustermesh-apiserver: enable parsing of ciliumendpointslices #38506

Merged

pippolo84 reviewed Apr 8, 2025

View reviewed changes

operator/pkg/ciliumendpointslice/log.log Outdated Show resolved Hide resolved

operator/pkg/ciliumendpointslice/controller_test.go Outdated Show resolved Hide resolved

operator/pkg/ciliumidentity/testutils/test_utils.go Outdated Show resolved Hide resolved

sypakine reviewed Apr 24, 2025

View reviewed changes

operator/pkg/ciliumidentity/reconciler.go Show resolved Hide resolved

operator/pkg/ciliumendpointslice/reconciler.go Outdated Show resolved Hide resolved

operator/pkg/ciliumendpointslice/reconciler.go Outdated Show resolved Hide resolved

jshr-w force-pushed the jshr/podtoces branch from 680a887 to e6d7890 Compare April 25, 2025 01:09

jshr-w added 12 commits December 26, 2025 22:19

wireguard: export config for enable-wireguard

467845e

Export WireGuard --enable-wireguard flag for use in Operator. CES controller requires --enable-wireguard to set the encryption of the CES. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

ces: cid update processing flow

84e2d3a

Add flow where controller will process CID updates, update the relevant mappings in the manager and local CES cache. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

ces: add cescache ces and ns handling

6a6b0c2

Allow the local CES datastore to store CES and namespace information. We also introduce the CESCacher interface, to allow the default and slim CES managers to handle their own local cache types. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

ces: handle namespace updates

4b7aeb8

Signed-off-by: jshr-w <shjayaraman@microsoft.com>

ces: add unit tests for ces update flow

59c500e

Signed-off-by: jshr-w <shjayaraman@microsoft.com>

ces: add pod state to CESCache

74741e3

Add store for pod state to CESCache, along with necessary helper functions. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

jshr-w added 3 commits December 26, 2025 22:46

ces: slimcontroller start flow

7eac204

Add start flow for SlimController to run. Modify test helpers for slim mode to be tested. Signed-off-by: jshr-w <shjayaraman@microsoft.com>

pippolo84 approved these changes Jan 26, 2026

View reviewed changes

joamaki approved these changes Jan 27, 2026

View reviewed changes

smagnani96 mentioned this pull request Mar 16, 2026

v1.19 Backports 2026-03-16 #44828

Merged

10 tasks

	// Ensure CiliumEndpointSlice is enabled only if CiliumEndpointCRD is enabled too.
	c.EnableCiliumEndpointSlice = vp.GetBool(EnableCiliumEndpointSlice)
	if c.EnableCiliumEndpointSlice && c.DisableCiliumEndpointCRD {
	log.Fatalf("Running Cilium with %s=%t requires %s set to false to enable CiliumEndpoint CRDs.",
	EnableCiliumEndpointSlice, c.EnableCiliumEndpointSlice, DisableCiliumEndpointCRDName)
	}

Conversation

jshr-w commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jshr-w commented Mar 25, 2025

Uh oh!

jshr-w commented Mar 25, 2025

Uh oh!

dlapcevic commented Mar 25, 2025

Uh oh!

qmonnet left a comment

Choose a reason for hiding this comment

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshr-w commented Mar 26, 2025

Uh oh!

pippolo84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshr-w commented Dec 26, 2025

Uh oh!

jshr-w commented Dec 30, 2025

Uh oh!

jshr-w commented Jan 6, 2026

Uh oh!

giorio94 commented Jan 7, 2026

Uh oh!

jshr-w commented Jan 20, 2026

Uh oh!

pippolo84 left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer commented Jan 26, 2026

Uh oh!

joestringer commented Jan 26, 2026

Uh oh!

jshr-w commented Jan 26, 2026

Uh oh!

joestringer commented Jan 28, 2026

Uh oh!

giorio94 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Jan 28, 2026

Uh oh!

jshr-w commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

jshr-w commented Mar 21, 2025 •

edited

Loading

giorio94 commented Jan 28, 2026 •

edited

Loading