Fix race condition in paused-replicas annotation#7233
Fix race condition in paused-replicas annotation#7233JorTurFer merged 2 commits intokedacore:mainfrom
Conversation
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
|
Thanks for the PR. Yesterday and today I have spend some time to investigate this issue. I have ran the E2E test suite ( I think the proposed code has a critical bug. About the original race condition, while the scenario you described is theoretically possible, I cannot reproduce it. I think the race window would be extremely narrow (both reconciles need to enter the stop/delete block). Can you share the exact steps to reproduce the race condition you encountered? |
|
Thank you for your response. I ran the integration test before submitting the PR and just reran it in our staging environment. Please let me know if I missed any configuration. I share the log #7231 I can do it again to update more details step later toady. |
|
please run pause_scaledobject_explicitly (You may have to turn it a few more times, but you will see that it fails) |
|
I even made a very quick analysis and fix. I hope this might solve your problem. |
c5b1248 to
a1010d3
Compare
|
@rickbrouwer thanks for helping I test your change https://github.com/rickbrouwer/keda/tree/pull-7233 and it works |
|
@nusmql Great! Can you fix the DCO? Then I will run the e2e tests. Further, your adjustment looks good :) |
|
@rickbrouwer thank you. I think that set pause status is better approach, blocked and update value. 👍 |
bfb4e54 to
19a971d
Compare
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com>
19a971d to
aa91a59
Compare
|
/run-e2e internals |
|
/run-e2e internals |
|
/run-e2e internals |
|
started 3 tests, just to be sure 🙂 |
|
hi @nusmql Your branch has conflicts that must be resolved. Could you look into that? |
Signed-off-by: nusmql <nusmql@gmail.com>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
/run-e2e internals |
|
Thanks a lot! |
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com>
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com>
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>
* fix: Correct parse error ActiveMQ (#7245) Signed-off-by: Rick Brouwer <rickbrouwer@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * fix: metricUnavailableValue parameter not working in Datadog scaler (#7241) * fix: metricUnavailableValue parameter not working in Datadog scaler The UseFiller flag was not being set correctly when metricUnavailableValue was configured. This fix distinguishes between 'not configured' and 'explicitly set to 0' by checking TriggerMetadata directly. Changes: - Set UseFiller in validateAPIMetadata() when metricUnavailableValue exists - Set UseFiller in validateClusterAgentMetadata() when metricUnavailableValue exists - Remove UseFiller logic from Validate() (responsibility moved to validate functions) - Update tests to verify UseFiller behavior with various values including 0 This allows users to explicitly set metricUnavailableValue to 0 and have it work as a fallback value, while still erroring when not configured. Fixes #7238 Signed-off-by: Hiroki Matsui <fenethtool@gmail.com> * test: cover both API and ClusterAgent modes in UseFiller test Updated TestDatadogMetadataValidateUseFiller to test both validateAPIMetadata() and validateClusterAgentMetadata() code paths. This ensures that the UseFiller flag is correctly set in both integration modes. Test cases now cover: - API mode: 5 test cases (not configured, 0, positive, negative, decimal) - Cluster Agent mode: 5 test cases (same variations) Signed-off-by: Hiroki Matsui <fenethtool@gmail.com> * refactor: use pointer type for FillValue to avoid TriggerMetadata access Changed FillValue from float64 to *float64 to distinguish between 'not configured' (nil) and 'explicitly set to any value including 0'. This addresses reviewer feedback about avoiding direct TriggerMetadata access and improves type safety and refactoring resistance. Changes: - FillValue type changed from float64 to *float64 with optional tag - validateAPIMetadata checks nil instead of TriggerMetadata map - validateClusterAgentMetadata checks nil instead of TriggerMetadata map - Dereference FillValue when returning fallback value (2 locations) - Update tests to handle pointer type with proper nil checks Signed-off-by: Hiroki Matsui <fenethtool@gmail.com> --------- Signed-off-by: Hiroki Matsui <fenethtool@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * Fix ScaledObject pause behavior when HPA doesn't exist (#7233) When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes #7231 Signed-off-by: nusmql <nusmql@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * fix: use TriggerError when all ScaledJob triggers fail (#7205) Signed-off-by: Rick Brouwer <rickbrouwer@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * Fix transfer-hpa-ownership panic when hpa name not provided (#7260) * chore: renormalize line endings Signed-off-by: James Williams <jamesleighwilliams@gmail.com> * fix: nil pointer when transfer-hpa-ownership is true but hpa name not specified (#7254) Signed-off-by: James Williams <jamesleighwilliams@gmail.com> * update changelog Signed-off-by: James Williams <jamesleighwilliams@gmail.com> * revert vendor changes Signed-off-by: James Williams <jamesleighwilliams@gmail.com> --------- Signed-off-by: James Williams <jamesleighwilliams@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * fix: restore HPA behavior when paused-scale-in/out annotation is deleted (#7291) When paused-scale-in or paused-scale-out annotation is deleted (not set to "false") and the corresponding selectPolicy (scaleDown.selectPolicy or scaleUp.selectPolicy) is not explicitly set in the ScaledObject spec, the HPA's SelectPolicy remains stuck at "Disabled" instead of being restored. This occurs even if other behavior fields like policies or stabilizationWindowSeconds are defined - only an explicit selectPolicy value triggers the update. Root cause: DeepDerivative treats nil as "unset" and considers it a subset of any value, so DeepDerivative(nil, Disabled) returns true, preventing the HPA update. Fix: Add explicit DeepEqual check for Behavior field, following the existing pattern used for Metrics length check. test: add e2e test for paused-scale-in annotation removal Signed-off-by: Dima Shevchuk <dshedimon@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * refactor: remove unused scaledObjectMetricSpecs variable (#7292) * refactor: remove unused scaledObjectMetricSpecs variable Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> * update CHANGELOG.md Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> --------- Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * fix: handle requestScaleLoop error in ScaledObject controller (#7273) * fix: handle requestScaleLoop error in ScaledObject controller Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> * chore: update CHANGELOG for PR #7273 Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> --------- Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * bump actions and go version (#7295) * bump actions and go version Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * bump deps Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * update pkgs Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * update tools Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * . Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * fix test Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * fix lint Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * update setup-go to use go.mod version Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * add nolint to exclude pulsar issues Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * fix devenv Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * fix codeql Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * fix splunk test Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * include job in links Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * update to ubuntu-slim some runners Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> * Update apis/keda/v1alpha1/scaledobject_webhook_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> * Update .github/workflows/scorecards.yml Co-authored-by: Jan Wozniak <wozniak.jan@gmail.com> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> --------- Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Jan Wozniak <wozniak.jan@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * update changelog Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> --------- Signed-off-by: Rick Brouwer <rickbrouwer@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> Signed-off-by: Hiroki Matsui <fenethtool@gmail.com> Signed-off-by: nusmql <nusmql@gmail.com> Signed-off-by: James Williams <jamesleighwilliams@gmail.com> Signed-off-by: Dima Shevchuk <dshedimon@gmail.com> Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es> Co-authored-by: Rick Brouwer <rickbrouwer@gmail.com> Co-authored-by: Matchan <fenethtool@gmail.com> Co-authored-by: nusmql <nusmql@gmail.com> Co-authored-by: James Williams <jamesleighwilliams@gmail.com> Co-authored-by: Dima Shevchuk <dshedimon@gmail.com> Co-authored-by: Kai Udo <76635578+u-kai@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Jan Wozniak <wozniak.jan@gmail.com>
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com> Signed-off-by: Dmitriy Altuhov <altuhovd@gmail.com>
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com>
When a ScaledObject has the paused annotation set before the HPA is created, the controller would fall through and create the HPA, ignoring the pause annotation. The fix writes the paused status to etcd immediately before stopping the scale loop or deleting the HPA. This prevents race conditions where concurrent reconciles triggered by HPA deletion would not see the paused status and perform redundant operations. The key insight is to establish the paused state in etcd BEFORE any operations that trigger new reconciles, ensuring subsequent reconciles see the paused status and exit early. This solution follows the approach suggested by @rickbrouwer. Fixes kedacore#7231 Signed-off-by: nusmql <nusmql@gmail.com>
Description
Fixes a race condition in the paused-replicas annotation handling that could cause ScaledObjects to get stuck in an inconsistent state.
Fixes #7231
Problem
When applying the
autoscaling.keda.sh/paused-replicasannotation to a ScaledObject, a race condition could occur that leaves the system permanently inconsistent:This happens intermittently and is timing-dependent.
Root Cause
The issue occurs when:
Paused=Truein memoryPaused=Trueto Kubernetes API is slow (50-100ms+)Paused=FalsestatusIn the buggy code, both reconciles would enter the stop/delete block because of the dangerous
scaledToPausedCount := truedefault:Solution
Add HPA existence check before attempting stop/delete operations. This uses HPA existence as a state indicator:
The key insight: The new HPA created in Reconcile 2nd is what actually scales the deployment to the paused replica count.
Changes
reconcileScaledObject()before stop/delete operationsProvide a description of what has been changed
Checklist
Fixes #
Relates to #