Skip to content

Promote realtime composition to beta - don't poll in the XR reconciler#6407

Merged
negz merged 12 commits intocrossplane:mainfrom
negz:pollution
Apr 28, 2025
Merged

Promote realtime composition to beta - don't poll in the XR reconciler#6407
negz merged 12 commits intocrossplane:mainfrom
negz:pollution

Conversation

@negz
Copy link
Copy Markdown
Member

@negz negz commented Apr 25, 2025

Description of your changes

Fixes #4828
Fixes #4316
Fixes #6408

See also recent realtime composition bug fixes in #6395.

After this commit realtime composition is enabled by default. You can disable it by setting --enable-realtime-composition=false.

When realtime composition is enabled XRs won't requeue on a poll interval by default.

XRs will requeue a reconcile when one of the following happens:

  • The XR changes
  • A composed resource changes
  • A new CompositionRevision is created, and the XR should auto-update
  • Crossplane's sync interval (default 1 hour) is exceeded
  • The last function pipeline run's TTL is exceeded

We consider a function pipeline run's TTL to be the shortest non-zero TTL of any RunFunctionResult returned by the pipeline. If the pipeline's TTL is non-zero we requeue after that TTL.

This lets functions control when Crossplane will call them again.

If a function needs to be called periodically it should set a TTL. A function might need to do this if it needs to periodically refresh its extra resources (Crossplane doesn't watch them) or read an input from an external system.

Crossplane also reserves the right to use the TTL for caching. This isn't implemented, but when it is Crossplane may not call the function before the TTL.

Today many functions return a TTL of 1 minute. This is because function-sdk-go and function-sdk-python use 1 minute as the default TTL (see response.go and response.py).

I think we should update the SDKs to use a longer default TTL, since the current default will effectively be equivalent to polling every minute.

I don't think this is really a breaking behavior change, but it is a significant one. I've added the scary label so we remember to include a prominent release note about the change.

This PR also adds a few basic metrics for the controller engine:

# HELP composition_controllers_started_total Total number of XR controllers started.
# TYPE composition_controllers_started_total counter
composition_controllers_started_total{controller="claim/xnopresources.nop.example.org"} 1
composition_controllers_started_total{controller="composite/xnopresources.nop.example.org"} 1
# HELP composition_watches_started_total Total number of watches started.
# TYPE composition_watches_started_total counter
composition_watches_started_total{controller="claim/xnopresources.nop.example.org",type="Claim"} 1
composition_watches_started_total{controller="claim/xnopresources.nop.example.org",type="CompositeResource"} 1
composition_watches_started_total{controller="composite/xnopresources.nop.example.org",type="ComposedResource"} 2
composition_watches_started_total{controller="composite/xnopresources.nop.example.org",type="CompositeResource"} 1
composition_watches_started_total{controller="composite/xnopresources.nop.example.org",type="CompositionRevision"} 1

I have:

Need help with this checklist? See the cheat sheet.

@negz negz force-pushed the pollution branch 3 times, most recently from 1c29af3 to 8a777c6 Compare April 26, 2025 07:49
@negz negz marked this pull request as ready for review April 26, 2025 07:54
@negz negz requested a review from a team as a code owner April 26, 2025 07:54
@negz negz requested review from bobh66, jbw976 and turkenh April 26, 2025 07:54
@negz negz mentioned this pull request Apr 26, 2025
7 tasks
@negz negz force-pushed the pollution branch 2 times, most recently from 4c0c456 to b2de309 Compare April 27, 2025 05:21
@negz negz added this to the v1.20 milestone Apr 27, 2025
Copy link
Copy Markdown
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me! Please see my comment around around possibility to turn it off and get back to the current state if needed.

func WithPollInterval(interval time.Duration) ReconcilerOption {
return WithPollIntervalHook(func(_ context.Context, _ *composite.Unstructured) time.Duration {
// Jitter the poll interval +/- 10%.
return interval + time.Duration((rand.Float64()-0.5)*2*(float64(interval)*0.1)) //nolint:gosec // No need for secure randomness
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this still be needed if someone needs/decides to disable the beta feature for whatever reason? I am particularly concerned about resource consumption/scalability of real time compositions and wondering if there is a way to get back to the previous configuration if something goes wrong.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition reconciler still sets a poll interval when realtime compositions is disabled. The jittering happens inline - I wanted to remote pollIntervalHook as a simplification.

o = append(o, composite.WithWatchStarter(composite.ControllerName(d.GetName()), h, r.engine))
o = append(o,
composite.WithWatchStarter(composite.ControllerName(d.GetName()), h, r.engine),
composite.WithPollInterval(0), // Disable polling.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turkenh You can see on line 551 of this file that it'll default to composite.WithPollInterval(r.options.PollInterval).

We only disable polling here inside the if r.options.Features.Enabled(features.EnableBetaRealtimeCompositions) branch.

negz added 12 commits April 28, 2025 12:09
Nothing has read this field for a long time - claims watch their XRs.

Signed-off-by: Nic Cope <nicc@rk0n.org>
After this commit realtime composition is enabled by default. You can
disable it by setting --enable-realtime-composition=false.

When realtime composition is enabled XRs won't requeue on a poll
interval by default.

XRs will requeue a reconcile when one of the following happens:

* The XR changes
* A composed resource changes
* A new CompositionRevision is created, and the XR should auto-update
* Crossplane's sync interval (default 1 hour) is exceeded
* The last function pipeline run's TTL is exceeded

We consider a function pipeline run's TTL to be the shortest non-zero
TTL of any RunFunctionResult returned by the pipeline. If the pipeline's
TTL is non-zero we requeue after that TTL.

This lets functions control when Crossplane will call them again.

If a function needs to be called periodically it should set a TTL. A
function might need to do this if it needs to periodically refresh its
extra resources (Crossplane doesn't watch them) or read an input from an
external system.

Crossplane also reserves the right to use the TTL for caching. This
isn't implemented, but when it is Crossplane may not call the function
_before_ the TTL.

Today many functions return a TTL of 1 minute. This is because
function-sdk-go and function-sdk-python use 1 minute as the default TTL.
I think we should update the SDKs to use a longer default TTL, since the
current default will effectively be equivalent to polling every minute.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Instead use regular composition tests to test for realtime behavior.

Signed-off-by: Nic Cope <nicc@rk0n.org>
We don't want the watch garbage collector to stop a watch for a composed
resource prematurely. If it did, the XR controller wouldn't be called to
reconcile the XR when that type of composed resource changed.

Before this commit the watch garbage collector determined what watches
were still needed only by listing from cache. If the cache was stale, it
might've incorrectly determined a watch was no longer needed and stopped
it.

After this commit, the watch garbage collector double checks by making
direct calls to the API server to determine what watches are still
needed before it stops any.

The race still exists, but it's much smaller now. A watch would have to
become needed between when Crossplane lists from the API server and when
it calls StopWatches.

Signed-off-by: Nic Cope <nicc@rk0n.org>
This'll let folks monitor things like:

* How many XR controllers are running
* How many watches are running
* Whether an XR controller is flapping
* Whether an XR controller is watching its XR and CompRev
* How many types of composed resource an XR controller is watching

Signed-off-by: Nic Cope <nicc@rk0n.org>
It's functionally identical (no pun intended) to the variant that uses
functions, but the native version has a nasty bug. When it patches the
MR it removes the MR's finalizer, which causes the XR and MR
controllers to fight. This is bad when realtime compositions are
enabled.

We could potentially fix the native P&T variant (e.g. by patching in the
finalizer) but I don't think it's worth it. Apart from this very
specific scenario where we artificially set a finalizer to block the
delete, the composition implementation shouldn't matter. Given native
P&T is deprecated let's drop it and save some time in E2E runs.

Signed-off-by: Nic Cope <nicc@rk0n.org>
It's a beta feature now - on by default.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Detect whether a revision is compatible earlier and more often, and
return early.

Signed-off-by: Nic Cope <nicc@rk0n.org>
30 seconds seems mostly okay, but sometimes it's taking ~45 in CI.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Just a drive-by nit. The convention in this struct is for XR scoped
stuff to be top-level, e.g. ConnectionDetails, events, etc.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Fixing some places where I missed saying that in the feature flag.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 28, 2025

This now appears to be failing E2E 100% of the time due to #6413. Looking into it.

Edit: For some reason this is failing just as much on main right now. Still looking into it, but I doubt it's related to this PR so I'm going to merge it.

@negz negz merged commit 09e27bd into crossplane:main Apr 28, 2025
36 of 38 checks passed
@negz negz deleted the pollution branch April 28, 2025 21:50
@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 28, 2025

Added this to the draft release notes for v1.20


Crossplane no longer reconciles XRs on a regular poll interval. The --poll-interval flag now has no effect on XRs. Instead Crossplane watches all resources and only reconciles XRs when something changes.

  • Crossplane also reconciles an XR when a composition function response's TTL expires.
  • ⚠️ To preserve the previous behavior when upgrading to v1.20.0, set --set args='{"--enable-realtime-composition=false"}' during the Helm upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants