feat: make event subscriber buffer size configurable#2282
Conversation
…rkflow job summary. On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
remove branch input. remove non-informative tags and versions from wo…
On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
…e event buffer
When a worker pool resolution event is dropped (non-blocking channel
send, buffer full), controllers stuck permanently in ResolutionInProgress
because return ctrl.Result{}, nil has no fallback requeue. This became
fatal after removing the interval field from Resource CRs (PR open-component-model#2116).
Changes:
- Add RequeueAfter: 30s on all ResolutionInProgress return paths in
resource, component, and deployer controllers as safety net
- Increase subscriber channel buffer from 10 to 100 to reduce drop
probability
- Deployer now propagates ErrResolutionInProgress to caller for proper
RequeueAfter handling
On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com>
Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
✅ Deploy Preview for ocm-website ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a configurable per-subscriber buffer size for resolver event channels: a CLI flag Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Remove RequeueAfter safety-net changes from controllers. Keep only the event channel buffer enhancement, but make it configurable through the new --resolver-subscriber-buffer-size flag (default: 10) instead of hardcoding it. This allows tuning for testing (e.g. buffer=1 to force drops) or production (e.g. buffer=100 to reduce drops). On-behalf-of: Gerald Morrison (SAP) <gerald.morrison@sap.com> Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
kubernetes/controller/cmd/main.go (1)
113-114: Default preserves prior drop behavior — document the tuning signal.Default
10matches the previously hardcoded buffer, so out-of-the-box behavior (and the original conformance flakiness scenario) is unchanged unless operators explicitly raise this flag. Since the earlierRequeueAftersafety-net was dropped, dropped events no longer have a timer-based recovery path; the only signal operators have isEventChannelDropsTotal.Consider calling out the
resolver_event_channel_drops_totalmetric in the flag help text so operators know when/why to increase the value.- flag.IntVar(&resolverSubscriberBuffer, "resolver-subscriber-buffer-size", 10, //nolint:mnd // no magic number - "The buffer size for each subscriber's event channel. A larger buffer reduces the probability of dropped resolution events under load.") + flag.IntVar(&resolverSubscriberBuffer, "resolver-subscriber-buffer-size", 10, //nolint:mnd // no magic number + "The buffer size for each subscriber's event channel. A larger buffer reduces the probability of dropped resolution events under load. "+ + "Tune upward if the resolver_event_channel_drops_total metric is non-zero.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kubernetes/controller/cmd/main.go` around lines 113 - 114, The flag help for resolverSubscriberBuffer (set via flag.IntVar with name "resolver-subscriber-buffer-size") should be updated to mention the tuning signal metric so operators know when to increase it: modify the flag's help string to reference the Prometheus metric resolver_event_channel_drops_total (or resolver_event_channel_drops_total) and explain that if that metric increases under load they should raise the buffer size to reduce dropped resolution events; keep the default value but document that raising the flag mitigates drops now that the RequeueAfter safety-net was removed.kubernetes/controller/internal/resolution/workerpool/workerpool.go (1)
98-100: LGTM — configurable buffer is implemented correctly.
PoolOptions.SubscriberBufferSize, the<= 0defaulting to10, and the use ofwp.SubscriberBufferSizeinSubscribe()are consistent and preserve backward-compatible behavior.One residual concern worth flagging (not a blocker): the non-blocking broadcast in
handleWorkItem(Lines 369–387) silently drops events when a channel is full. All three controllers (resource, deployer, component) return immediately onErrResolutionInProgresswith noRequeueAfterfallback—they rely purely on event-driven re-triggering. Under sustained bursts, if a drop happens, this leaves a path to a permanently-stalledReady=False / ResolutionInProgressobject with no timeout recovery. Consider either:
- keeping the 30s
RequeueAftersafety net on theResolutionInProgressreturn paths as defense-in-depth, or- making the broadcast blocking with a short per-subscriber timeout (trading worker latency for guaranteed delivery).
Happy to leave this for a follow-up if the team prefers to observe
EventChannelDropsTotalbehavior in production first.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kubernetes/controller/internal/resolution/workerpool/workerpool.go` around lines 98 - 100, The non-blocking broadcast in handleWorkItem can silently drop events when subscriber channels (created via Subscribe using wp.SubscriberBufferSize / PoolOptions.SubscriberBufferSize) are full, risking permanently stalled ResolutionInProgress objects; update handleWorkItem to either (a) restore a 30s RequeueAfter on controllers that currently return ErrResolutionInProgress so the controller will retry as a safety net, or (b) change the broadcast to be blocking with a short per-subscriber timeout (e.g., loop with select on send and a time.After) so delivery is retried briefly before counting a drop and incrementing EventChannelDropsTotal; pick one approach and implement consistent behavior across the resource/deployer/component controllers and ensure EventChannelDropsTotal is incremented when a send truly fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kubernetes/controller/internal/controller/deployer/deployer_controller.go`:
- Line 53: The import for the configuration package in deployer_controller.go is
pointing to a non-existent package
"ocm.software/open-component-model/kubernetes/controller/pkg/configuration"
causing references to Configuration and LoadConfigurations to break; revert the
import to the existing internal/configuration package (the same change should be
applied to resource_controller.go and component_controller.go) so that the code
uses the existing Configuration type and LoadConfigurations function defined in
internal/configuration/config.go.
---
Nitpick comments:
In `@kubernetes/controller/cmd/main.go`:
- Around line 113-114: The flag help for resolverSubscriberBuffer (set via
flag.IntVar with name "resolver-subscriber-buffer-size") should be updated to
mention the tuning signal metric so operators know when to increase it: modify
the flag's help string to reference the Prometheus metric
resolver_event_channel_drops_total (or resolver_event_channel_drops_total) and
explain that if that metric increases under load they should raise the buffer
size to reduce dropped resolution events; keep the default value but document
that raising the flag mitigates drops now that the RequeueAfter safety-net was
removed.
In `@kubernetes/controller/internal/resolution/workerpool/workerpool.go`:
- Around line 98-100: The non-blocking broadcast in handleWorkItem can silently
drop events when subscriber channels (created via Subscribe using
wp.SubscriberBufferSize / PoolOptions.SubscriberBufferSize) are full, risking
permanently stalled ResolutionInProgress objects; update handleWorkItem to
either (a) restore a 30s RequeueAfter on controllers that currently return
ErrResolutionInProgress so the controller will retry as a safety net, or (b)
change the broadcast to be blocking with a short per-subscriber timeout (e.g.,
loop with select on send and a time.After) so delivery is retried briefly before
counting a drop and incrementing EventChannelDropsTotal; pick one approach and
implement consistent behavior across the resource/deployer/component controllers
and ensure EventChannelDropsTotal is incremented when a send truly fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: de82ea8b-7e2e-45a0-aeb9-84857e702179
📒 Files selected for processing (5)
kubernetes/controller/cmd/main.gokubernetes/controller/internal/controller/component/component_controller.gokubernetes/controller/internal/controller/deployer/deployer_controller.gokubernetes/controller/internal/controller/resource/resource_controller.gokubernetes/controller/internal/resolution/workerpool/workerpool.go
…kiness-requeue-safety-net Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
…e-safety-net' into fix/conformance-flakiness-requeue-safety-net Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
a85af9c to
ac771b3
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
kubernetes/controller/cmd/main.go (1)
90-90: Minor: variable name drops theSizesuffix.
resolverSubscriberBufferis slightly inconsistent with the flag nameresolver-subscriber-buffer-sizeand thePoolOptions.SubscriberBufferSizefield it maps to. Renaming toresolverSubscriberBufferSizewould make the chain flag → var → option field uniform. Purely cosmetic.Also applies to: 113-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kubernetes/controller/cmd/main.go` at line 90, Rename the local variable resolverSubscriberBuffer to resolverSubscriberBufferSize so it matches the flag name resolver-subscriber-buffer-size and the PoolOptions.SubscriberBufferSize field; update all usages where the flag is parsed and passed into PoolOptions (the variable declaration and the places where it's assigned or referenced, e.g., the flag definition and when constructing PoolOptions) to use the new resolverSubscriberBufferSize identifier for consistent naming.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@kubernetes/controller/cmd/main.go`:
- Line 90: Rename the local variable resolverSubscriberBuffer to
resolverSubscriberBufferSize so it matches the flag name
resolver-subscriber-buffer-size and the PoolOptions.SubscriberBufferSize field;
update all usages where the flag is parsed and passed into PoolOptions (the
variable declaration and the places where it's assigned or referenced, e.g., the
flag definition and when constructing PoolOptions) to use the new
resolverSubscriberBufferSize identifier for consistent naming.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b346f4ca-bc5e-4626-b26b-d23a530046b9
📒 Files selected for processing (10)
kubernetes/controller/chart/README.mdkubernetes/controller/chart/templates/crd/components.delivery.ocm.software.yamlkubernetes/controller/chart/templates/crd/deployers.delivery.ocm.software.yamlkubernetes/controller/chart/templates/crd/repositories.delivery.ocm.software.yamlkubernetes/controller/chart/templates/crd/resources.delivery.ocm.software.yamlkubernetes/controller/chart/templates/manager/manager.yamlkubernetes/controller/chart/values.schema.jsonkubernetes/controller/chart/values.yamlkubernetes/controller/cmd/main.gokubernetes/controller/internal/resolution/workerpool/workerpool.go
💤 Files with no reviewable changes (4)
- kubernetes/controller/chart/templates/crd/repositories.delivery.ocm.software.yaml
- kubernetes/controller/chart/templates/crd/resources.delivery.ocm.software.yaml
- kubernetes/controller/chart/templates/crd/components.delivery.ocm.software.yaml
- kubernetes/controller/chart/templates/crd/deployers.delivery.ocm.software.yaml
✅ Files skipped from review due to trivial changes (2)
- kubernetes/controller/chart/values.schema.json
- kubernetes/controller/chart/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- kubernetes/controller/internal/resolution/workerpool/workerpool.go
frewilhelm
left a comment
There was a problem hiding this comment.
Subscriber channel buffer in WorkerPool.Subscribe() is hardcoded to 10. Under CI resource constraints, this can fill up and silently drop resolution events. No way to tune without code changes.
I am missing a change that adjusts the channel buffer? Am I missing something? :D
…kiness-requeue-safety-net Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
0606bcf accidentally clobbered the webhook conversion blocks added by open-component-model#2274 when adopting helm chart changes for the subscriber buffer flag. Restore all 4 CRD templates to upstream/main state. Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
Subscribe() was still hardcoding 10 despite SubscriberBufferSize being added to PoolOptions. Wire it up so the CLI flag actually takes effect. Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kubernetes/controller/cmd/main.go`:
- Around line 249-256: The subscriber channels are still created with a
hardcoded buffer of 10, so the PoolOptions.SubscriberBufferSize passed into
NewWorkerPool is ignored; locate the channel creation in the workerpool
implementation where it does make(chan []RequesterInfo, 10) (the subscriber
creation code inside NewWorkerPool / the worker pool struct methods) and replace
the literal 10 with the configured buffer value from the pool options (e.g.,
PoolOptions.SubscriberBufferSize or the workerPool instance's
opts.SubscriberBufferSize) so the --resolver-subscriber-buffer-size flag
actually controls subscriber channel capacity.
- Around line 134-138: The current check compares resolverSubscriberBuffer to
resolverWorkerQueueLength before handling non-positive subscriber sizes
(workerpool.NewWorkerPool treats <=0 as default 10), so add an explicit
validation for resolverSubscriberBuffer <= 0 first: if resolverSubscriberBuffer
<= 0, log an error via setupLog.Error with a clear message referencing
"resolver-subscriber-buffer-size" and exit (or alternatively normalize it to the
workerpool default of 10 before further checks); then perform the existing check
comparing resolverSubscriberBuffer and resolverWorkerQueueLength. Reference
resolverSubscriberBuffer, resolverWorkerQueueLength, and
workerpool.NewWorkerPool() in the change so the intent and behavior are clear.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a63ce66a-5b7a-444a-a6f5-9eb510c381a7
📒 Files selected for processing (1)
kubernetes/controller/cmd/main.go
…ion comment - Add TestWorkerPool_SubscriberBufferSize: verifies custom, zero, and negative values produce expected channel capacity. - Log SubscriberBufferSize at worker pool startup for operational visibility. - Add inline comment explaining why subscriber buffer must not exceed queue length. Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
no, you don't miss anything, I did 😂. Corrected now. I also added the new param to the log and commented it. |
…at startup Passing <= 0 bypassed the upper bound check because NewWorkerPool silently defaults it to 10, which could exceed the queue length. Fail fast instead of silently applying a different value. Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
Co-authored-by: Frederic Wilhelm <fre.wilhelm@gmail.com> Signed-off-by: Gerald Morrison <67469729+morri-son@users.noreply.github.com>
…kiness-requeue-safety-net Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
….com:morri-son/open-component-model into fix/conformance-flakiness-requeue-safety-net Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
frewilhelm
left a comment
There was a problem hiding this comment.
I think that fine and does make the subscriber buffer size configurable. However, do we want to adjust the default since we think they are the root-cause for the flakes?
<!-- markdownlint-disable MD041 --> #### What this PR does / why we need it Follow up of #2282, now using the possibility to configure the default and set it to 100. Signed-off-by: Gerald Morrison (SAP) <gerald.morrison@sap.com> Co-authored-by: Gerald Morrison (SAP) <gerald.morrison@sap.com>
Problem
Subscriber channel buffer in
WorkerPool.Subscribe()is hardcoded to 10. Under CI resource constraints, this can fill up and silently drop resolution events. No way to tune without code changes.Fix
SubscriberBufferSizefield inPoolOptions(default: 10)--resolver-subscriber-buffer-sizeto configure itSubscribe()uses the configured value instead of a hardcoded constantThis enables tuning for testing (e.g.
=1to force drops) or production (e.g.=100to reduce drop probability).