Skip to content

policy: introduce lazy ordered/sorted selections#43368

Closed
odinuge wants to merge 4 commits intocilium:mainfrom
odinuge:odinuge/selector-cache-perf-large-list-of-ids
Closed

policy: introduce lazy ordered/sorted selections#43368
odinuge wants to merge 4 commits intocilium:mainfrom
odinuge:odinuge/selector-cache-perf-large-list-of-ids

Conversation

@odinuge
Copy link
Copy Markdown
Member

@odinuge odinuge commented Dec 16, 2025

This introduces a separate sorted structure that can be used by the components that need it. Previously all selections were sorted. This very cosly in the hotpath of the selection updates, as sorting 10k+ slices is pretty expensive. This still creates a copy of the selections, but it does not sort it.

This can in theory result in higher memory usage since we now keep a set (thats in practice a map), and the sorted slice if asked for. In practice not many things need the sorted version, noteably the dnsproxy as well as some l7/xds configs. Its also not totally sure that those always need it either - and in the cases they need it, they also end up keeping it in memory - so the diff is pretty small in practice. Most of the datapath is also using updates, so its not relying on the sorted version all the time.

Its a bit hard to benchmark this directly without cherry-picking results, but a simple test-case where a single selector selects say 20k identities (eg. the 'cluster' entity), that are updated say 5 times/sec - results in a lot of cpu cycles spent on sorting that slice.

Fixes: #issue-number

n/a

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 16, 2025
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Dec 16, 2025
@odinuge odinuge force-pushed the odinuge/selector-cache-perf-large-list-of-ids branch from 9e9b35c to 3196140 Compare December 16, 2025 12:00
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Dec 16, 2025

/test

@odinuge odinuge force-pushed the odinuge/selector-cache-perf-large-list-of-ids branch 2 times, most recently from 1637272 to 35f4c45 Compare December 16, 2025 12:39
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Dec 16, 2025

/test

@odinuge odinuge changed the title policy: introduce lazy ordered selections policy: introduce lazy ordered/sorted selections Dec 16, 2025
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Dec 16, 2025

Screenshot 2025-12-16 at 16 11 52 While testing a set of identities changing concurrently, while having a selector that selects the `cluster` entity we see that a lot of time is spent doing this sorting.

id types.SelectorId
users map[CachedSelectionUser]struct{}
cachedSelections map[identity.NumericIdentity]struct{}
cachedSelections set.Set[identity.NumericIdentity]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into doing part.Set[identity.NumericIdentity] here as well. That could be useful to reduce the number of allocations - but it will most likely do more allocations that the current version in main, but potentially be faster and allocate less memory.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Dec 17, 2025

Benchmarks using #43407;

pkg: github.com/cilium/cilium/pkg/policy
                                │ /tmp/original.yml │             /tmp/pr.yml              │
                                │      sec/op       │    sec/op     vs base                │
ResolveNoMatchingRules-12              1.848m ± 37%   2.126m ± 48%        ~ (p=0.052 n=10)
SelectorCacheIdentityUpdates-12      2037.77µ ±  9%   10.59µ ±  4%  -99.48% (p=0.000 n=10)
geomean                                1.941m         150.0µ        -92.27%

                                │ /tmp/original.yml │             /tmp/pr.yml              │
                                │       B/op        │     B/op      vs base                │
ResolveNoMatchingRules-12              157.3Ki ± 0%   157.3Ki ± 0%        ~ (p=0.584 n=10)
SelectorCacheIdentityUpdates-12       164.80Ki ± 0%   11.52Ki ± 0%  -93.01% (p=0.000 n=10)
geomean                                161.0Ki        42.57Ki       -73.56%

                                │ /tmp/original.yml │              /tmp/pr.yml               │
                                │     allocs/op     │  allocs/op   vs base                   │
ResolveNoMatchingRules-12               20.02k ± 0%   20.02k ± 0%         ~ (p=1.000 n=10) ¹
SelectorCacheIdentityUpdates-12          51.00 ± 2%   116.00 ± 0%  +127.45% (p=0.000 n=10)
geomean                                 1.011k        1.524k        +50.81%

So it seems like this is much faster at identity updates, but it does have a significantly higher allocation count; but even more significant reduction in cpu time and allocation size.

This introduces a separate sorted structure that can be used by the
components that need it. Previously all selections were sorted. This
very costly in the hotpath of the selection updates, as sorting 10k+
slices is pretty expensive. This still creates a copy of the selections,
but it does not sort it.

This can in theory result in higher memory usage since we now keep a set
(in practice a map), and the sorted slice if asked for. In
practice not many things need the sorted version, notably the dnsproxy
as well as some l7/xds configs. Its also not totally sure that those
always need it either - and in the cases they need it, they also end up
keeping it in memory - so the diff is pretty small in practice. Most
of the datapath is also using updates, so its not relying on the sorted
version all the time.

Its a bit hard to benchmark this directly without cherry-picking
results, but a simple test-case where a single selector selects say 20k
identities (eg. the 'cluster' entity), that are updated say 5 times/sec
- results in a lot of cpu cycles spent on sorting that slice.

Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
@odinuge odinuge force-pushed the odinuge/selector-cache-perf-large-list-of-ids branch 2 times, most recently from 0ed7a43 to d39b47d Compare December 17, 2025 13:15
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Dec 17, 2025

/test

This swaps the set.Set to a part.Set, so that we can avoid cloning the
whole datastructure on each write. This does result in more allocations
in general, but overall it looks like a win since we reduce the time it
takes to clone it. The same applies to the total amount of allocated
bytes, that seems to go down drastically.

Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
@odinuge odinuge force-pushed the odinuge/selector-cache-perf-large-list-of-ids branch from d39b47d to 895b305 Compare December 17, 2025 13:54
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Dec 17, 2025

/test

@odinuge odinuge mentioned this pull request Dec 18, 2025
8 tasks
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 7, 2026

Seems like we are having conflicts. I'll wait for #42580 to be worked through and merge before rebasing this. 😄

@joestringer joestringer added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jan 8, 2026
@aanm aanm removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jan 14, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Feb 14, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 1, 2026

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this Mar 1, 2026
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 6, 2026

I can't seem to reopen, so I'll open a new one when I get time. I've been busy mitigating other critical bugs like #44328 the past few weeks, and to get v1.17 working so this hasn't been prioritized. So I'm still planning to do this work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants