Skip to content

optimization: allow for lazy sidecar initialization#47221

Merged
istio-testing merged 12 commits intoistio:masterfrom
sschepens:lazy-sidecars
Jun 6, 2025
Merged

optimization: allow for lazy sidecar initialization#47221
istio-testing merged 12 commits intoistio:masterfrom
sschepens:lazy-sidecars

Conversation

@sschepens
Copy link
Copy Markdown
Contributor

@sschepens sschepens commented Oct 6, 2023

Please provide a description of this PR:
I know this PR will probably generate a lot of controversy, please read on.

In large deployments, most of the Istio's work is done in initSidecarScopes to scope configurations that are only visible to each Sidecar and it has a couple of problems:

  • there's no parallelization of work
  • there's a lot of potentially unnecessary work going on: Istio has to generate SidecarScopes for every Sidecar in the mesh while there could be Sidecars with no Pods or no Pods connected to a single Istio instance.

My proposal is to allow SidecarScopes to be lazy initialized, that means that Sidecar Scopes still are created but empty of indexes, whenever a Proxy selects a SidecarScope it is initialized (using sync.OnceFunc which guarantees that it's only done once). What this allows is for SidecarScopes initialization to be almost zero unless the SidecarScope is actually used and also allow SidecarScopes computation to be parallelized in the goroutines that end up generating the XDS config, this may sound strange, but it's actually already ocurring when Sidecar Scopes have to be derived.

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Oct 6, 2023
@istio-testing
Copy link
Copy Markdown
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 6, 2023
@sschepens
Copy link
Copy Markdown
Contributor Author

@howardjohn could you please have a look at this?

@sschepens
Copy link
Copy Markdown
Contributor Author

/test all

@sschepens sschepens marked this pull request as ready for review October 6, 2023 20:45
@sschepens sschepens requested review from a team as code owners October 6, 2023 20:45
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Oct 6, 2023
@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 10, 2023
@howardjohn howardjohn added the do-not-merge/hold Block automatic merging of a PR. label Oct 10, 2023
if !features.EnableLazySidecarEvaluation {
initSidecarScopeInternalIndexes(ps, out, configNamespace)
} else {
out.initFunc = sync.OnceFunc(func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call so that we do not have to call initFunc seperately in tests and what we execute is same when EnableLazySidecarEvaluation is enabled/disabled?

if !features.EnableLazySidecarEvaluation {
	out.initFunc()
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that I was just trying to avoid adding any overhead when flag is disabled. Regarding tests, I still need to call initFunc in some tests because I need to make sure SidecarScopes are initialized to be able to compare them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali I pushed some changes, let me know if they seem ok

@ramaraochavali
Copy link
Copy Markdown
Contributor

How many sidecars do you have and what is the kind of benefit you have seen with this change?

@sschepens
Copy link
Copy Markdown
Contributor Author

sschepens commented Oct 11, 2023

How many sidecars do you have and what is the kind of benefit you have seen with this change?

We currently have 2.5k sidecars but will eventually have much more. We currently have debounce times of about 30s which is waay too much and almost all of this 30s is spent rebuilding sidecar indexes.

We still haven't been able to test this change in production to see the benefit, will try to deploy this week and see if I can get some measurements.

@howardjohn
Copy link
Copy Markdown
Member

At the very least if we are going to make some large changes like this, we ought to have performance numbers indicating the changes.

@sschepens
Copy link
Copy Markdown
Contributor Author

At the very least if we are going to make some large changes like this, we ought to have performance numbers indicating the changes.

totally agree, i'll get some numbers in the coming days, I wanted to open the PR in the meantime to get your feelings about this.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Oct 12, 2023
@sschepens
Copy link
Copy Markdown
Contributor Author

sschepens commented Oct 18, 2023

Following up on this, today I managed to deploy this, here's the impact:
Screenshot 2023-10-18 at 16 02 43

Debounce times were fixed at 30s because that's the current histogram limit in istiod, that basically means they were taking at least 30s. They are now taking a maximum of 2.5s. This makes sense because processing has been pushed to other phases. This means our previous Real Convergence Time was actually 30s+.
Convergence Time is now peaking at 10s maximum, this is caused by the lazy initialization, but it's actually not impacting avg and p95 times. This makes our current Real Convergence Time a maximum of 12.5s.

CPU usage certainly increased, because now we're using mutliple goroutines to process sidecars, this is intended and actually what is expected on our use case, concurrency can be tweaked with PILOT_PUSH_THROTTLE

clarification: yellow lines are Max, purple are P95 and blue are Avg.

@hzxuzhonghu
Copy link
Copy Markdown
Member

not related: We may need another metric fot the overall time between a config update to xds push done

@sschepens
Copy link
Copy Markdown
Contributor Author

not related: We may need another metric fot the overall time between a config update to xds push done

indeed

@howardjohn
Copy link
Copy Markdown
Member

not related: We may need another metric fot the overall time between a config update to xds push done

I've tried similar, its pretty hard actually with the way our current design is.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Nov 20, 2023
@linsun
Copy link
Copy Markdown
Member

linsun commented Nov 20, 2023

not stale

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Nov 20, 2023
@wulianglongrd
Copy link
Copy Markdown
Member

@sschepens Has this modification been applied to the production environment in your company?

@istio-testing istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR needs to be rebased before being merged size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 3, 2025
@sschepens
Copy link
Copy Markdown
Contributor Author

rebased this, made it enabled by default and made concurrent conversion only work when lazy initialization is disabled, let me know if this is ok or if I should change the default behaviour

Copy link
Copy Markdown
Contributor

@keithmattix keithmattix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM but holding off on approving since I wasn't involved in the original context

@ramaraochavali
Copy link
Copy Markdown
Contributor

rebased this, made it enabled by default and made concurrent conversion only work when lazy initialization is disabled, let me know if this is ok or if I should change the default behaviour

Let us flip the behaviour as follows

  • If concurrency is specified - let us honour it and switch to concurrentScope for this release even if ENABLE_LAZY_SIDECAR_EVALUATION is defaulted to true?

Add a release note stating that concurrency will be deprecated and will be removed in future so that people can validate this new behaviour by removing concurrency flag?

@wulianglongrd Any concerns with the above since you implemented the concurrency and might be using it in prod already?

@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 5, 2025
@sschepens
Copy link
Copy Markdown
Contributor Author

Let us flip the behaviour as follows

* If concurrency is specified - let us honour it and switch to concurrentScope for this release even if ENABLE_LAZY_SIDECAR_EVALUATION is defaulted to true?

Add a release note stating that concurrency will be deprecated and will be removed in future so that people can validate this new behaviour by removing concurrency flag?

@wulianglongrd Any concerns with the above since you implemented the concurrency and might be using it in prod already?

Done, also fixed tests and removed function DefaultSidecarScopeForNamespace which was only used in tests.

@ramaraochavali
Copy link
Copy Markdown
Contributor

Thank you. Will take a look tomorrow

Copy link
Copy Markdown
Contributor

@ramaraochavali ramaraochavali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is awesome improvement.

@wulianglongrd
Copy link
Copy Markdown
Member

Thanks, I'm happy with the good optimization.

@wulianglongrd Any concerns with the above since you implemented the concurrency and might be using it in prod already?

Yes, we've been using concurrency since January 2024 until now (built it ourselves), and if I recall correctly, the initSidecarScope latency has dropped by 80%, a very large cluster with nearly 70k pods. As long as it supports smooth transition, then I think it is ok.

We have done performance testing with this and with #41453 - We found this to give better performance and better convergence times.

Out of curiosity, how much of a performance difference is there between concurrency and this implementation? It would be nice to have comparative data.

@sschepens
Copy link
Copy Markdown
Contributor Author

Out of curiosity, how much of a performance difference is there between concurrency and this implementation? It would be nice to have comparative data.

It really depends on the use case, this feature allows for Istio to never perform sidecar scope conversion of unused sidecar scopes which can really save a lot of time depending on the use case. (If you have a large mesh then it's highly likely that and istiod instance doesn't have a sidecar connected for every sidecar scope and also there are times when sidecar scopes are not actually updated to the proxies).

In the case where all sidecar scopes are used, then this feature should provide roughly the same benefit as the previous concurrency, but using the "standard" PILOT_PUSH_THROTTLE concurrency, so if your PILOT_CONVERT_SIDECAR_SCOPE_CONCURRENCY configration is less than PILOT_PUSH_THROTTLE then this could end up reducing latency.

But maybe @ramaraochavali can provide more info given that they've recently tested both features.

@sschepens
Copy link
Copy Markdown
Contributor Author

@ramaraochavali can you remove the hold?

@istio-testing
Copy link
Copy Markdown
Collaborator

@sschepens: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
integ-ambient-mc_istio 74993e8 link false /test integ-ambient-mc
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@keithmattix keithmattix removed the do-not-merge/hold Block automatic merging of a PR. label Jun 6, 2025
@ramaraochavali
Copy link
Copy Markdown
Contributor

We have seen significant reduction in proxy convergence time with this change when there are many sidecars and there is a churn in the cluster. Concurrency + Increased resource has improved but not as significant as this change. I will share more details tomorrow

@istio-testing istio-testing merged commit 768a907 into istio:master Jun 6, 2025
29 of 30 checks passed
@sschepens sschepens deleted the lazy-sidecars branch June 6, 2025 20:12
fjglira pushed a commit to fjglira/istio that referenced this pull request Sep 26, 2025
* upstream/master: (28 commits)
  Automator: update common-files@master in istio/istio@master (istio#56545)
  Automator: update proxy@master in istio/istio@master (istio#56544)
  Automator: update go-control-plane in istio/istio@master (istio#56543)
  Automator: update proxy@master in istio/istio@master (istio#56540)
  Automator: update ztunnel@master in istio/istio@master (istio#56532)
  Ambient: In ambient index, filter configs by revision (istio#56477)
  Automator: update istio/client-go@master dependency in istio/istio@master (istio#56539)
  Automator: update proxy@master in istio/istio@master (istio#56538)
  Automator: update common-files@master in istio/istio@master (istio#56537)
  optimization: allow for lazy sidecar initialization (istio#47221)
  static collection eager indexes (istio#56530)
  fix typo in flag (istio#56534)
  feat: enable support for proxy protocol on status port (istio#55986)
  remove finding of pods by IP (istio#56502)
  Automator: update proxy@master in istio/istio@master (istio#56528)
  migrate file monitor to krt (istio#55970)
  Automator: update istio/client-go@master dependency in istio/istio@master (istio#56525)
  Automator: update ztunnel@master in istio/istio@master (istio#56518)
  Fix crash in merging http routes (istio#56499)
  krt: add assertions (istio#56510)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/perf and scalability lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants