Skip to content

Fix race condition in paused-replicas annotation#7233

Merged
JorTurFer merged 2 commits intokedacore:mainfrom
nusmql:fix-7231-pause-replicas-race-condition
Nov 19, 2025
Merged

Fix race condition in paused-replicas annotation#7233
JorTurFer merged 2 commits intokedacore:mainfrom
nusmql:fix-7231-pause-replicas-race-condition

Conversation

@nusmql
Copy link
Contributor

@nusmql nusmql commented Nov 4, 2025

Description

Fixes a race condition in the paused-replicas annotation handling that could cause ScaledObjects to get stuck in an inconsistent state.

Fixes #7231

Problem

When applying the autoscaling.keda.sh/paused-replicas annotation to a ScaledObject, a race condition could occur that leaves the system permanently inconsistent:

  • KEDA marks the object as paused at the target replica count (Paused=True)
  • The underlying Deployment remains at its previous replica count (not scaled)
  • No HPA or scale loop exists to correct the state
  • Manual intervention is required to recover

This happens intermittently and is timing-dependent.

Root Cause

The issue occurs when:

  1. Reconcile 1st deletes the HPA (triggers Reconcile 2nd via watch) and sets Paused=True in memory
  2. Status write to persist Paused=True to Kubernetes API is slow (50-100ms+)
  3. Reconcile 2nd triggered by HPA deletion is fast (30-40ms) and reads stale Paused=False status

In the buggy code, both reconciles would enter the stop/delete block because of the dangerous scaledToPausedCount := true default:

case needsToPause:
    scaledToPausedCount := true  // ← Dangerous default
    if conditions.GetPausedCondition().Status == metav1.ConditionTrue {
        scaledToPausedCount = r.checkIfTargetResourceReachPausedCount(...)
        if scaledToPausedCount {
            return // Already done
        }
    }
    if scaledToPausedCount {
        // Enters this block in BOTH reconciles during race
        stopScaleLoop()   // Already stopped!
        deleteHPA()       // Already deleted!
        return            // Exit without creating new HPA/loop
    }**Result:** No HPA created, no scale loop started → System stuck forever.

Solution

Add HPA existence check before attempting stop/delete operations. This uses HPA existence as a state indicator:

  • HPA exists → First reconcile entering pause mode → Stop loop and delete HPA
  • HPA doesn't exist → Race reconcile or already deleted → Fall through to create new HPA/scale loop

The key insight: The new HPA created in Reconcile 2nd is what actually scales the deployment to the paused replica count.

if scaledToPausedCount {
    // Check if HPA exists before attempting to stop scale loop and delete HPA
    hpaName := getHPAName(scaledObject)
    foundHpa := &autoscalingv2.HorizontalPodAutoscaler{}
    err := r.Client.Get(ctx, types.NamespacedName{Name: hpaName, Namespace: scaledObject.Namespace}, foundHpa)

    if err == nil {
        // HPA exists - stop the scale loop and delete the HPA (Reconcile #1)
        stopScaleLoop()
        deleteHPA()
        conditions.SetPausedCondition(metav1.ConditionTrue, ...)
        return
    }
    // HPA doesn't exist - fall through to create new HPA and scale loop (Reconcile #2)
    // The new HPA will scale the deployment to the paused replica count
}

Changes

  • Added HPA existence check in reconcileScaledObject() before stop/delete operations
  • Simplified logic: check if HPA exists, if yes → stop/delete, if no → fall through
  • Falls through to normal reconcile which creates new HPA with paused replica targets

Provide a description of what has been changed

Checklist

Fixes #

Relates to #

@nusmql nusmql requested a review from a team as a code owner November 4, 2025 09:47
@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team November 4, 2025 09:47
@rickbrouwer
Copy link
Member

Thanks for the PR. Yesterday and today I have spend some time to investigate this issue.

I have ran the E2E test suite (internals/pause_scaledobject_explicitly) about 10 times with the original code, all passed.
I ran the same tests with the proposed fix, these failed multiple.

2025-11-05T08:45:16Z	ERROR	scaleexecutor	failed to patch Objects	{"scaledobject.Name": "pause-scaledobject-explicitly-test-so", "scaledObject.Namespace": "pause-scaledobject-explicitly-test-ns", "scaleTarget.Name": "pause-scaledobject-explicitly-test-deployment", "error": "Patch \"https://10.96.0.1:443/apis/keda.sh/v1alpha1/namespaces/pause-scaledobject-explicitly-test-ns/scaledobjects/pause-scaledobject-explicitly-test-so/status\": context canceled"}
2025-11-05T08:45:16Z	ERROR	scaleexecutor	Error updating last active time	{"scaledobject.Name": "pause-scaledobject-explicitly-test-so", "scaledObject.Namespace": "pause-scaledobject-explicitly-test-ns", "scaleTarget.Name": "pause-scaledobject-explicitly-test-deployment", "error": "Patch \"https://10.96.0.1:443/apis/keda.sh/v1alpha1/namespaces/pause-scaledobject-explicitly-test-ns/scaledobjects/pause-scaledobject-explicitly-test-so/status\": context canceled"}

I think the proposed code has a critical bug.

About the original race condition, while the scenario you described is theoretically possible, I cannot reproduce it. I think the race window would be extremely narrow (both reconciles need to enter the stop/delete block).

Can you share the exact steps to reproduce the race condition you encountered?
Is there an operator log where we can prove that this happened, that would be very valuable.

@nusmql
Copy link
Contributor Author

nusmql commented Nov 5, 2025

Thank you for your response. I ran the integration test before submitting the PR and just reran it in our staging environment. Please let me know if I missed any configuration.

I share the log #7231
SuccessCasse-Logs-2025-11-02 17_32_48.txt
FailedCase-Logs-2025-11-02 17_30_57.txt

I can do it again to update more details step later toady.

 E2E_INSTALL_KEDA=false go test -v -tags e2e ./internals/pause_scaledobject/

=== RUN   TestScaler
    pause_scaledobject_test.go:111: --- setting up ---
    helper.go:285: deleting namespace pause-scaledobject-test-ns
    helper.go:339: waiting for namespace pause-scaledobject-test-ns deletion
    helper.go:272: Creating namespace - pause-scaledobject-test-ns
    helper.go:702: Applying template: deploymentTemplate
    helper.go:702: Applying template: monitoredDeploymentTemplate
    helper.go:702: Applying template: scaledObjectAnnotatedTemplate
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 0, Target - 0
    pause_scaledobject_test.go:157: --- testing pausing at 0 ---
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 2, Target - 2
    helper.go:617: Waiting for some time to ensure deployment replica count doesn't change from 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    helper.go:624: Deployment - pause-scaledobject-test-deployment, Current  - 0
    pause_scaledobject_test.go:168: --- testing scale out ---
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 2, Target - 2
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 0, Target - 1
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 0, Target - 1
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 1, Target - 1
    pause_scaledobject_test.go:179: --- testing pausing at N ---
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 2, Target - 0
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-monitored, Current  - 0, Target - 0
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 1, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 2, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 3, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 3, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 3, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 3, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 4, Target - 5
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 5, Target - 5
    pause_scaledobject_test.go:192: --- testing scale in ---
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 5, Target - 0
    helper.go:525: Waiting for deployment replicas to hit target. Deployment - pause-scaledobject-test-deployment, Current  - 0, Target - 0
    helper.go:771: Deleting template: scaledObjectAnnotatedTemplate
    helper.go:771: Deleting template: monitoredDeploymentTemplate
    helper.go:771: Deleting template: deploymentTemplate
    helper.go:285: deleting namespace pause-scaledobject-test-ns
    helper.go:339: waiting for namespace pause-scaledobject-test-ns deletion
    helper.go:339: waiting for namespace pause-scaledobject-test-ns deletion
    helper.go:339: waiting for namespace pause-scaledobject-test-ns deletion
--- PASS: TestScaler (96.84s)
PASS
ok      github.com/kedacore/keda/v2/tests/internals/pause_scaledobject  (cached)

@rickbrouwer
Copy link
Member

rickbrouwer commented Nov 5, 2025

please run pause_scaledobject_explicitly (You may have to turn it a few more times, but you will see that it fails)

@rickbrouwer
Copy link
Member

I even made a very quick analysis and fix. I hope this might solve your problem.
Could you test this branch to see if it solves it?

https://github.com/rickbrouwer/keda/tree/pull-7233

@nusmql nusmql force-pushed the fix-7231-pause-replicas-race-condition branch from c5b1248 to a1010d3 Compare November 5, 2025 14:35
@nusmql
Copy link
Contributor Author

nusmql commented Nov 5, 2025

@rickbrouwer thanks for helping

I test your change https://github.com/rickbrouwer/keda/tree/pull-7233 and it works

@rickbrouwer
Copy link
Member

rickbrouwer commented Nov 5, 2025

@nusmql Great! Can you fix the DCO? Then I will run the e2e tests. Further, your adjustment looks good :)

@nusmql
Copy link
Contributor Author

nusmql commented Nov 5, 2025

@rickbrouwer thank you.
But after reviewing your code I think your solution is better, pls commit your code.

I think that set pause status is better approach, blocked and update value. 👍

@nusmql nusmql force-pushed the fix-7231-pause-replicas-race-condition branch from bfb4e54 to 19a971d Compare November 5, 2025 18:55
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
@nusmql nusmql force-pushed the fix-7231-pause-replicas-race-condition branch from 19a971d to aa91a59 Compare November 5, 2025 18:58
@rickbrouwer
Copy link
Member

rickbrouwer commented Nov 5, 2025

/run-e2e internals
Update: You can check the progress here

@rickbrouwer
Copy link
Member

rickbrouwer commented Nov 5, 2025

/run-e2e internals
Update: You can check the progress here

@rickbrouwer
Copy link
Member

rickbrouwer commented Nov 5, 2025

/run-e2e internals
Update: You can check the progress here

@rickbrouwer
Copy link
Member

started 3 tests, just to be sure 🙂

@rickbrouwer rickbrouwer added the merge-conflict This PR has a merge conflict label Nov 18, 2025
@rickbrouwer
Copy link
Member

hi @nusmql

Your branch has conflicts that must be resolved. Could you look into that?

@keda-automation keda-automation requested a review from a team November 19, 2025 10:14
@snyk-io
Copy link

snyk-io bot commented Nov 19, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@rickbrouwer rickbrouwer removed the merge-conflict This PR has a merge conflict label Nov 19, 2025
@rickbrouwer
Copy link
Member

rickbrouwer commented Nov 19, 2025

/run-e2e internals
Update: You can check the progress here

@JorTurFer JorTurFer merged commit 63ebfb8 into kedacore:main Nov 19, 2025
25 checks passed
@JorTurFer
Copy link
Member

Thanks a lot!

@nusmql nusmql deleted the fix-7231-pause-replicas-race-condition branch November 19, 2025 18:23
@JorTurFer JorTurFer mentioned this pull request Dec 7, 2025
31 tasks
JorTurFer pushed a commit to JorTurFer/keda that referenced this pull request Dec 8, 2025
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
JorTurFer pushed a commit to JorTurFer/keda that referenced this pull request Dec 8, 2025
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
JorTurFer pushed a commit to JorTurFer/keda that referenced this pull request Dec 8, 2025
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>
JorTurFer added a commit that referenced this pull request Dec 8, 2025
* fix: Correct parse error ActiveMQ (#7245)

Signed-off-by: Rick Brouwer <rickbrouwer@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* fix: metricUnavailableValue parameter not working in Datadog scaler (#7241)

* fix: metricUnavailableValue parameter not working in Datadog scaler

The UseFiller flag was not being set correctly when metricUnavailableValue
was configured. This fix distinguishes between 'not configured' and
'explicitly set to 0' by checking TriggerMetadata directly.

Changes:
- Set UseFiller in validateAPIMetadata() when metricUnavailableValue exists
- Set UseFiller in validateClusterAgentMetadata() when metricUnavailableValue exists
- Remove UseFiller logic from Validate() (responsibility moved to validate functions)
- Update tests to verify UseFiller behavior with various values including 0

This allows users to explicitly set metricUnavailableValue to 0 and have
it work as a fallback value, while still erroring when not configured.

Fixes #7238

Signed-off-by: Hiroki Matsui <fenethtool@gmail.com>

* test: cover both API and ClusterAgent modes in UseFiller test

Updated TestDatadogMetadataValidateUseFiller to test both validateAPIMetadata()
and validateClusterAgentMetadata() code paths. This ensures that the UseFiller
flag is correctly set in both integration modes.

Test cases now cover:
- API mode: 5 test cases (not configured, 0, positive, negative, decimal)
- Cluster Agent mode: 5 test cases (same variations)

Signed-off-by: Hiroki Matsui <fenethtool@gmail.com>

* refactor: use pointer type for FillValue to avoid TriggerMetadata access

Changed FillValue from float64 to *float64 to distinguish between
'not configured' (nil) and 'explicitly set to any value including 0'.

This addresses reviewer feedback about avoiding direct TriggerMetadata
access and improves type safety and refactoring resistance.

Changes:
- FillValue type changed from float64 to *float64 with optional tag
- validateAPIMetadata checks nil instead of TriggerMetadata map
- validateClusterAgentMetadata checks nil instead of TriggerMetadata map
- Dereference FillValue when returning fallback value (2 locations)
- Update tests to handle pointer type with proper nil checks

Signed-off-by: Hiroki Matsui <fenethtool@gmail.com>

---------

Signed-off-by: Hiroki Matsui <fenethtool@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* Fix ScaledObject pause behavior when HPA doesn't exist (#7233)

When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes #7231

Signed-off-by: nusmql <nusmql@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* fix: use TriggerError when all ScaledJob triggers fail (#7205)

Signed-off-by: Rick Brouwer <rickbrouwer@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* Fix transfer-hpa-ownership panic when hpa name not provided (#7260)

* chore: renormalize line endings

Signed-off-by: James Williams <jamesleighwilliams@gmail.com>

* fix: nil pointer when transfer-hpa-ownership is true but hpa name not specified (#7254)

Signed-off-by: James Williams <jamesleighwilliams@gmail.com>

* update changelog

Signed-off-by: James Williams <jamesleighwilliams@gmail.com>

* revert vendor changes

Signed-off-by: James Williams <jamesleighwilliams@gmail.com>

---------

Signed-off-by: James Williams <jamesleighwilliams@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* fix: restore HPA behavior when paused-scale-in/out annotation is deleted (#7291)

When paused-scale-in or paused-scale-out annotation is deleted (not set
to "false") and the corresponding selectPolicy (scaleDown.selectPolicy
or scaleUp.selectPolicy) is not explicitly set in the ScaledObject spec,
the HPA's SelectPolicy remains stuck at "Disabled" instead of being
restored.

This occurs even if other behavior fields like policies or
stabilizationWindowSeconds are defined - only an explicit selectPolicy
value triggers the update.

Root cause: DeepDerivative treats nil as "unset" and considers it a
subset of any value, so DeepDerivative(nil, Disabled) returns true,
preventing the HPA update.

Fix: Add explicit DeepEqual check for Behavior field, following the
existing pattern used for Metrics length check.

test: add e2e test for paused-scale-in annotation removal

Signed-off-by: Dima Shevchuk <dshedimon@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* refactor: remove unused scaledObjectMetricSpecs variable (#7292)

* refactor: remove unused scaledObjectMetricSpecs variable

Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>

* update CHANGELOG.md

Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>

---------

Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* fix: handle requestScaleLoop error in ScaledObject controller (#7273)

* fix: handle requestScaleLoop error in ScaledObject controller

Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>

* chore: update CHANGELOG for PR #7273

Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>

---------

Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* bump actions and go version (#7295)

* bump actions and go version

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* bump deps

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* update pkgs

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* update tools

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* .

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* fix test

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* fix lint

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* update setup-go to use go.mod version

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* add nolint to exclude pulsar issues

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* fix devenv

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* fix codeql

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* fix splunk test

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* include job in links

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* update to ubuntu-slim some runners

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>

* Update apis/keda/v1alpha1/scaledobject_webhook_test.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>

* Update .github/workflows/scorecards.yml

Co-authored-by: Jan Wozniak <wozniak.jan@gmail.com>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>

---------

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jan Wozniak <wozniak.jan@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

* update changelog

Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>

---------

Signed-off-by: Rick Brouwer <rickbrouwer@gmail.com>
Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz>
Signed-off-by: Hiroki Matsui <fenethtool@gmail.com>
Signed-off-by: nusmql <nusmql@gmail.com>
Signed-off-by: James Williams <jamesleighwilliams@gmail.com>
Signed-off-by: Dima Shevchuk <dshedimon@gmail.com>
Signed-off-by: u-kai <76635578+u-kai@users.noreply.github.com>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>
Co-authored-by: Rick Brouwer <rickbrouwer@gmail.com>
Co-authored-by: Matchan <fenethtool@gmail.com>
Co-authored-by: nusmql <nusmql@gmail.com>
Co-authored-by: James Williams <jamesleighwilliams@gmail.com>
Co-authored-by: Dima Shevchuk <dshedimon@gmail.com>
Co-authored-by: Kai Udo <76635578+u-kai@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jan Wozniak <wozniak.jan@gmail.com>
alt-dima pushed a commit to alt-dima/keda that referenced this pull request Dec 13, 2025
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
Signed-off-by: Dmitriy Altuhov <altuhovd@gmail.com>
tangobango5 pushed a commit to tangobango5/keda that referenced this pull request Dec 22, 2025
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
tangobango5 pushed a commit to tangobango5/keda that referenced this pull request Feb 13, 2026
When a ScaledObject has the paused annotation set before the HPA is
created, the controller would fall through and create the HPA, ignoring
the pause annotation.

The fix writes the paused status to etcd immediately before stopping
the scale loop or deleting the HPA. This prevents race conditions where
concurrent reconciles triggered by HPA deletion would not see the paused
status and perform redundant operations.

The key insight is to establish the paused state in etcd BEFORE any
operations that trigger new reconciles, ensuring subsequent reconciles
see the paused status and exit early.

This solution follows the approach suggested by @rickbrouwer.

Fixes kedacore#7231

Signed-off-by: nusmql <nusmql@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in paused-replicas annotation causes ScaledObject to get stuck in inconsistent state

3 participants