Skip to content

feat(provider): pause redis-operator reconciliation during StatefulSet scale-to-zero#963

Merged
acouvreur merged 3 commits into
sablierapp:mainfrom
jlaska:feat/redis-operator-statefulset-skip-reconcile
Jun 5, 2026
Merged

feat(provider): pause redis-operator reconciliation during StatefulSet scale-to-zero#963
acouvreur merged 3 commits into
sablierapp:mainfrom
jlaska:feat/redis-operator-statefulset-skip-reconcile

Conversation

@jlaska

@jlaska jlaska commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Closes #962

What this does

I run a homelab where several applications sit behind Traefik with Sablier managing scale-to-zero. After the recent CloudNativePG support landed I started enabling it across my stack, and noticed Redis instances managed by the OT-CONTAINER-KIT redis-operator were the odd one out — Sablier would scale the StatefulSet to zero on session expiry, but the operator would immediately reconcile replicas back to 1.

The operator already ships a pause mechanism: redis.opstreelabs.in/skip-reconcile: "true" on the Redis CR. Setting this before scaling to zero keeps the operator out of the way, and clearing it after scale-up hands control back.

  • stop → set skip-reconcile: "true" on the owning Redis CR, then scale the StatefulSet to 0
  • start → scale the StatefulSet back to 1, then remove the annotation so the operator resumes normal reconciliation
  • If the scale fails, a cleanup defer removes the annotation so the operator is never left paused with pods still running.

Design choices

  • No new kind. This hooks into the existing StatefulSet path rather than introducing a redis kind. The operator propagates labels (including sablier.enable/sablier.group) from the Redis CR to the StatefulSet it creates, so existing label-based opt-in works without any change to how users configure Sablier.
  • Only on scale-to-zero. The annotation is not set in scale-mode stops (sablier.idle.replicas >= 1), where the StatefulSet stays non-zero and the operator should continue reconciling normally.
  • Version-agnostic owner detection. The owner reference check matches on API group only (redis.redis.opstreelabs.in), not on the specific version string, so it stays correct if the operator promotes to v1.
  • Best-effort. If the dynamic client is not configured or the Redis CR cannot be fetched, a warning is logged and the scale proceeds unchanged — existing behaviour for non-redis-operator StatefulSets is completely unaffected.

Testing done

  • Unit tests (statefulset_redis_operator_test.go) covering owner detection (including forward-compatibility with a future v1 APIVersion), annotation set/clear, no-op for plain StatefulSets, stop annotation set with cleanup-on-failure, and start annotation removal.
  • make fmt, golangci-lint run, and the full pkg/provider/kubernetes test suite pass.
  • Real cluster end-to-end: session expiry scales the StatefulSet to 0 with skip-reconcile: "true" set; the operator logs the annotation and stands down. A new request scales back to 1 and the annotation is removed. The redis-operator resumes normally.

@jlaska jlaska requested a review from acouvreur as a code owner June 4, 2026 19:59
@github-actions github-actions Bot added documentation Improvements or additions to documentation provider Issue related to a provider labels Jun 4, 2026
jlaska added 2 commits June 4, 2026 16:01
…Set scale-to-zero

The OT-CONTAINER-KIT redis-operator continuously reconciles its managed
StatefulSets back to the desired replica count. When Sablier scales a
redis-operator-owned StatefulSet to zero the operator immediately restores
the replica count, making scale-to-zero ineffective.

Before scaling a StatefulSet to zero, check whether it is controlled by a
Redis CR (via ownerReferences). If so, set the redis.opstreelabs.in/skip-reconcile
annotation on the owning Redis CR to pause the operator's reconciliation loop.
After scaling back up, remove the annotation to restore normal operation.

The annotation patch is best-effort: if it fails (e.g. the CRD is absent or
the dynamic client is unconfigured) a warning is logged and the scale proceeds
unchanged, preserving existing behaviour for non-redis-operator StatefulSets.

fix(kubernetes): address code review findings on redis-operator skip-reconcile

Correctness:
- Annotation no longer leaks when scale fails: a cleanup defer is registered
  immediately after setting skip-reconcile=true, and clears it if p.scale()
  returns an error.
- Annotation is only set on the scale-to-zero path, not on scale-mode stops
  (sc.Idle.Replicas >= 1), so the operator is not paused while the StatefulSet
  remains non-zero.
- Deferred annotation removal in InstanceStart now uses context.WithoutCancel
  so a cancelled request context cannot leave skip-reconcile set after a
  successful scale-up.
- Both InstanceStop and InstanceStart log a warning when the StatefulSet fetch
  fails rather than silently skipping annotation management.

Quality:
- redisOperatorOwner matches on API group only (not the full version string),
  so the fix continues to work if the operator promotes from v1beta2 to v1.
  The now-redundant redisOperatorAPIVersion constant is removed.
- The redundant StatefulSet GET on the stop path is eliminated by moving the
  annotation block to after getWorkloadLabels (which already fetches the
  StatefulSet), placing it immediately before p.scale(ctx, parsed, 0).
- New unit tests in statefulset_redis_operator_test.go cover redisOperatorOwner
  (including forward-compat with v1), apiVersionGroup, setRedisOperatorSkipReconcile
  set/clear, no-op for plain StatefulSets, stop annotation set/cleanup-on-failure,
  and start annotation removal after successful scale.
@jlaska jlaska force-pushed the feat/redis-operator-statefulset-skip-reconcile branch from 0a93b92 to 72e5a43 Compare June 4, 2026 20:01
Comment thread pkg/provider/kubernetes/statefulset_redis_operator_test.go
Installs a minimal Redis CRD into the shared k3s cluster (following the
same pattern as the CNPG integration test) and creates a Redis CR plus a
companion StatefulSet whose ownerReference points to that CR, simulating
what the redis-operator would produce.

The three sub-tests verify Sablier's behavior against a real API server:
- stop: InstanceStop sets skip-reconcile on the Redis CR and scales the
  StatefulSet to 0
- inspect: InstanceInspect correctly reports the stopped state
- start: InstanceStart scales back to 1 and clears the annotation

No operator binary is run — the tests validate Sablier's own API
interactions with real Kubernetes objects, not operator behavior.
@jlaska jlaska force-pushed the feat/redis-operator-statefulset-skip-reconcile branch from 646fae1 to 57bd7c4 Compare June 5, 2026 00:46
@jlaska

jlaska commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@acouvreur I added statefulset_redis_operator_integration_test.go following the same pattern as the CNPG test — it installs a minimal Redis CRD into the shared k3s cluster and creates a Redis CR alongside a StatefulSet whose ownerReference points to it, simulating what the operator produces.

The three sub-tests verify Sablier's behavior against a real API server:

  • stop: InstanceStop sets skip-reconcile: "true" on the Redis CR and scales the StatefulSet to 0
  • inspect: InstanceInspect correctly reports the stopped state
  • start: InstanceStart scales back to 1 and clears the annotation

No operator binary is deployed — the tests validate Sablier's own API interactions with real Kubernetes objects, not operator behavior.

Let me know if there are additional changes you'd like.

@acouvreur acouvreur left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for the pull request

@acouvreur acouvreur merged commit f3f1af0 into sablierapp:main Jun 5, 2026
3 of 4 checks passed
@sablier-bot sablier-bot Bot mentioned this pull request Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation provider Issue related to a provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OT-CONTAINER-KIT redis-operator reconciles StatefulSet replicas back to 1 after Sablier scales to zero

2 participants