Skip to content

[Operator] Add L2 RESP (Redis/Valkey) adapter support#2967

Merged
royyhuang merged 9 commits intoLMCache:devfrom
royyhuang:feat/opertor-redis-intergration
Apr 11, 2026
Merged

[Operator] Add L2 RESP (Redis/Valkey) adapter support#2967
royyhuang merged 9 commits intoLMCache:devfrom
royyhuang:feat/opertor-redis-intergration

Conversation

@royyhuang
Copy link
Copy Markdown
Contributor

@royyhuang royyhuang commented Apr 6, 2026

Summary

  • Add first-class Redis/Valkey L2 adapter support to the LMCache operator with typed CRD fields, cross-namespace secret management, and store/prefetch policy configuration
  • Switch DaemonSet entrypoint from python3 -m to lmcache server CLI, expose both ZMQ and HTTP ports in Service
  • Fix reconciler resourceVersion conflicts by re-fetching before status updates

Details

L2 Backend

  • New l2Backend field (singular, single adapter for now) replaces generic l2Backends list
  • RESPL2AdapterSpec: host, port, numWorkers, maxCapacityGB, authSecretRef
  • RawL2AdapterSpec: escape hatch for nixl_store, fs, mock, etc.
  • storePolicy / prefetchPolicy / prefetchMaxInFlight for L2 pipeline control
  • CRD validation: exactly-one-of constraint, RESP field validation

Cross-Namespace Auth Secrets

  • SecretReference with optional namespace field
  • Controller reads source secret → creates managed copy in engine namespace with ownerRef
  • DaemonSet uses secretKeyRef on local copy (credentials never in pod args or kubectl describe)
  • Username is optional (Optional: true) for password-only Redis Enterprise auth

Server & Service

  • Entrypoint: /opt/venv/bin/lmcache server instead of python3 -m lmcache.v1.multiprocess.server
  • httpPort field (default 8080) for HTTP frontend (health checks, admin API)
  • Lookup Service exposes both server (ZMQ) and http ports

Reconciler Fixes

  • Re-fetch engine before Status().Update() to avoid conflicts from Owns watch events
  • Return done=true after finalizer addition to prevent stale resourceVersion

Buildkite CI

  • is-operator-only.sh: detects operator-only PRs, steps report green without running tests
  • Applied to all pipelines (unit, e2e, integration, comprehensive, multiprocess, blend, correctness)
  • operator/* added as safe path in should-run-comprehensive.sh

Test plan

  • make test — all unit tests pass (api, controller, resources)
  • make lint — 0 issues
  • Deployed with make run against live cluster with Redis Enterprise
  • Verified --l2-adapter args, env var injection, cross-namespace secret copy
  • Verified health check via HTTP port-forward

Note: Currently only a single L2 adapter is supported at a time. LMCache MP mode is designed to support multiple adapters in cascade, but this is not yet fully tested. Once validated, the operator will support multiple adapters.


Note

Medium Risk
Introduces a CRD shape change (l2Backendsl2Backend) and new secret-handling logic (cross-namespace copy + env injection), which can break existing manifests and affects credential management. Also changes the LMCache DaemonSet entrypoint and exposed Service ports, so rollout should be validated against running deployments.

Overview
Adds a new singular spec.l2Backend with typed RESP (Redis/Valkey) configuration, a raw escape hatch for other adapters, and L2 store/prefetch policy flags; updates validation/tests and regenerates CRD/deepcopy accordingly.

The controller now manages RESP auth via Secrets (including cross-namespace source → managed local copy) and injects credentials as env vars; RBAC and controller ownership are updated to reconcile Secret resources.

Switches the LMCache DaemonSet to run /opt/venv/bin/lmcache server, adds server.httpPort (default 8080), and exposes both ZMQ and HTTP ports on the lookup Service; docs and sample manifests are updated, and status/finalizer handling is adjusted to avoid resourceVersion conflicts.

Reviewed by Cursor Bugbot for commit 3882f1e. Bugbot is set up for automated code reviews on this repo. Configure here.

Add first-class support for Redis/Valkey as an L2 storage backend in the
LMCache operator, with cross-namespace secret management and Buildkite CI
skip logic for operator-only PRs.

L2 Backend:
- Replace generic l2Backends list with typed l2Backend (single adapter)
- Add RESPL2AdapterSpec with host, port, numWorkers, maxCapacityGB, authSecretRef
- Add RawL2AdapterSpec as escape hatch for other adapter types (nixl_store, fs, mock)
- Add storePolicy, prefetchPolicy, prefetchMaxInFlight to L2BackendSpec
- CRD validation for RESP fields and exactly-one-of constraint

Auth Secret Management:
- Cross-namespace SecretReference (name + optional namespace)
- Controller copies source secret to engine namespace as managed secret with ownerRef
- DaemonSet references local copy via secretKeyRef (credentials never in pod args)
- Username is optional (Optional: true) for password-only Redis Enterprise auth

Server:
- Switch entrypoint from python3 -m to lmcache server CLI
- Add httpPort to ServerSpec (default 8080) for HTTP frontend
- Expose both ZMQ and HTTP ports in lookup Service and DaemonSet

Reconciler Fixes:
- Re-fetch engine before status update to avoid resourceVersion conflicts
- Return done=true after finalizer addition to prevent stale object usage

Buildkite CI:
- Add is-operator-only.sh script to skip Python/CUDA tests for operator-only PRs
- Steps still trigger and report green so GitHub required checks pass
- Add operator/ as safe path in should-run-comprehensive.sh

Docs:
- Update README with L2 Redis examples, raw escape hatch, single-adapter caveat
- Add /dev/shm warning for both LMCache and vLLM pods
- Add full reference sample CR with all fields commented out
- Update DESIGN.md L2 section

Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a test-skipping optimization for operator-only changes and refactors the LMCache operator to support a structured L2 backend configuration, specifically adding native Redis/Valkey (RESP) support with secure credential management. Key changes include the introduction of cross-namespace secret reconciliation, an updated container entrypoint, and improved controller stability through resource re-fetching. Review feedback recommends hardening the secret reconciliation logic by validating required keys and ensuring controller references are correctly applied during patches, alongside better error handling for JSON serialization.

Comment thread operator/internal/controller/reconcile_helpers.go
Comment thread operator/internal/controller/reconcile_helpers.go Outdated
Comment thread operator/internal/resources/compute.go Outdated
Comment thread operator/internal/controller/lmcacheengine_controller.go
Comment thread operator/internal/resources/helpers.go Outdated
- Validate source secret has required 'password' key before copying
- Set ownerRef on existing secret during patch (not just on desired)
- Handle json.Marshal errors in L2 adapter JSON builders
- Add Owns(&corev1.Secret{}) watch so secret changes trigger reconcile
- Remove unused NeedsCrossNamespaceSecret function

Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 64c6f17. Configure here.

patch := client.MergeFrom(existing.DeepCopy())
existing.Data = desired.Data
existing.Labels = desired.Labels
return r.Patch(ctx, existing, patch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OwnerRef change lost due to wrong DeepCopy ordering

Medium Severity

In reconcileRESPAuthSecret, SetControllerReference is called on existing before client.MergeFrom(existing.DeepCopy()). The DeepCopy captures the already-mutated state (with ownerRef), so the merge patch only includes Data and Labels changes — the ownerRef mutation is invisible to the diff and never persisted. All other reconcile methods in this file avoid this by setting the ownerRef on desired instead of existing. The DeepCopy call needs to happen before SetControllerReference for the ownerRef to be included in the patch.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 64c6f17. Configure here.

Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@royyhuang
Copy link
Copy Markdown
Contributor Author

@ruizhang0101 PTAL.

Copy link
Copy Markdown
Contributor

@ruizhang0101 ruizhang0101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this support multiple L2 adapters?

EDIT: It is documented that this PR only supports one single L2 adapter, but I am wondering can we support multiple adapters for this PR as well or it will be added as a following PR soon since this will be really useful for multi-tenant.

PS: Maybe not relevant to this PR, but if there are multiple l2 adapters, it would be great to have ID for each adapters in the lmcache and also operator to differentiate them.

@royyhuang
Copy link
Copy Markdown
Contributor Author

Can this support multiple L2 adapters?

EDIT: It is documented that this PR only supports one single L2 adapter, but I am wondering can we support multiple adapters for this PR as well or it will be added as a following PR soon since this will be really useful for multi-tenant.

PS: Maybe not relevant to this PR, but if there are multiple l2 adapters, it would be great to have ID for each adapters in the lmcache and also operator to differentiate them.

I feel it would be better to support in another PR since I am not sure if multi-adapter has been thoroughly tested and validated. There could be some questions like you brought up regarding adding adapter ID need to be addressed. So that definitely worth another PR.

@royyhuang royyhuang added the full Run comprehensive tests on this PR label Apr 7, 2026
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@royyhuang royyhuang enabled auto-merge (squash) April 7, 2026 22:36
@royyhuang royyhuang merged commit 456aa16 into LMCache:dev Apr 11, 2026
27 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
* [Operator] Add L2 RESP (Redis/Valkey) adapter support

Add first-class support for Redis/Valkey as an L2 storage backend in the
LMCache operator, with cross-namespace secret management and Buildkite CI
skip logic for operator-only PRs.

L2 Backend:
- Replace generic l2Backends list with typed l2Backend (single adapter)
- Add RESPL2AdapterSpec with host, port, numWorkers, maxCapacityGB, authSecretRef
- Add RawL2AdapterSpec as escape hatch for other adapter types (nixl_store, fs, mock)
- Add storePolicy, prefetchPolicy, prefetchMaxInFlight to L2BackendSpec
- CRD validation for RESP fields and exactly-one-of constraint

Auth Secret Management:
- Cross-namespace SecretReference (name + optional namespace)
- Controller copies source secret to engine namespace as managed secret with ownerRef
- DaemonSet references local copy via secretKeyRef (credentials never in pod args)
- Username is optional (Optional: true) for password-only Redis Enterprise auth

Server:
- Switch entrypoint from python3 -m to lmcache server CLI
- Add httpPort to ServerSpec (default 8080) for HTTP frontend
- Expose both ZMQ and HTTP ports in lookup Service and DaemonSet

Reconciler Fixes:
- Re-fetch engine before status update to avoid resourceVersion conflicts
- Return done=true after finalizer addition to prevent stale object usage

Buildkite CI:
- Add is-operator-only.sh script to skip Python/CUDA tests for operator-only PRs
- Steps still trigger and report green so GitHub required checks pass
- Add operator/ as safe path in should-run-comprehensive.sh

Docs:
- Update README with L2 Redis examples, raw escape hatch, single-adapter caveat
- Add /dev/shm warning for both LMCache and vLLM pods
- Add full reference sample CR with all fields commented out
- Update DESIGN.md L2 section

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

* [Operator] Address review feedback from Gemini and Cursor bots

- Validate source secret has required 'password' key before copying
- Set ownerRef on existing secret during patch (not just on desired)
- Handle json.Marshal errors in L2 adapter JSON builders
- Add Owns(&corev1.Secret{}) watch so secret changes trigger reconcile
- Remove unused NeedsCrossNamespaceSecret function

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

* [Operator] Revert Buildkite CI changes (will be in separate PR)

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

---------

Signed-off-by: royyhuang <roy.y.huang@gmail.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
* [Operator] Add L2 RESP (Redis/Valkey) adapter support

Add first-class support for Redis/Valkey as an L2 storage backend in the
LMCache operator, with cross-namespace secret management and Buildkite CI
skip logic for operator-only PRs.

L2 Backend:
- Replace generic l2Backends list with typed l2Backend (single adapter)
- Add RESPL2AdapterSpec with host, port, numWorkers, maxCapacityGB, authSecretRef
- Add RawL2AdapterSpec as escape hatch for other adapter types (nixl_store, fs, mock)
- Add storePolicy, prefetchPolicy, prefetchMaxInFlight to L2BackendSpec
- CRD validation for RESP fields and exactly-one-of constraint

Auth Secret Management:
- Cross-namespace SecretReference (name + optional namespace)
- Controller copies source secret to engine namespace as managed secret with ownerRef
- DaemonSet references local copy via secretKeyRef (credentials never in pod args)
- Username is optional (Optional: true) for password-only Redis Enterprise auth

Server:
- Switch entrypoint from python3 -m to lmcache server CLI
- Add httpPort to ServerSpec (default 8080) for HTTP frontend
- Expose both ZMQ and HTTP ports in lookup Service and DaemonSet

Reconciler Fixes:
- Re-fetch engine before status update to avoid resourceVersion conflicts
- Return done=true after finalizer addition to prevent stale object usage

Buildkite CI:
- Add is-operator-only.sh script to skip Python/CUDA tests for operator-only PRs
- Steps still trigger and report green so GitHub required checks pass
- Add operator/ as safe path in should-run-comprehensive.sh

Docs:
- Update README with L2 Redis examples, raw escape hatch, single-adapter caveat
- Add /dev/shm warning for both LMCache and vLLM pods
- Add full reference sample CR with all fields commented out
- Update DESIGN.md L2 section

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

* [Operator] Address review feedback from Gemini and Cursor bots

- Validate source secret has required 'password' key before copying
- Set ownerRef on existing secret during patch (not just on desired)
- Handle json.Marshal errors in L2 adapter JSON builders
- Add Owns(&corev1.Secret{}) watch so secret changes trigger reconcile
- Remove unused NeedsCrossNamespaceSecret function

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

* [Operator] Revert Buildkite CI changes (will be in separate PR)

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

---------

Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants