External Kafka & Redis
This guide covers how to point the Rulebricks Helm chart at your own Kafka and Redis infrastructure instead of using the bundled instances. It is written for platform engineers deploying Rulebricks into a self-hosted Kubernetes environment.
Supabase externalization is not covered here. See the brief section on Supabase at the end of this document for the full rationale.
How Rulebricks Uses These Services
For the overall component layout and request flow, see Architecture. The part that matters here: HPS (the rule execution server) uses a correlation-ID request/response pattern over Kafka. When a solve request arrives:
- HPS produces a message to the
solutionKafka topic with a unique correlation ID and a designated response partition. - A worker pod consumes the message, evaluates the rule, and produces the result to the
solution-responsetopic on the exact partition the originating HPS replica is listening on. - HPS resolves the pending request and returns the result to the caller.
The idempotent Kafka producer guarantees exactly-once delivery at the broker level, and the correlation-ID mechanism ensures responses always route back to the correct HPS replica. This is why partition counts matter so much on an external cluster.
Redis sits in front of Supabase as a shared cache layer. API key authentication, rule/flow definitions, and named-environment lookups are all cached in Redis with short TTLs (60 to 180 seconds), backed by an in-process LRU per pod. In practice, Supabase sees very few direct queries under normal operation.
Externalizing Kafka
How Rulebricks Uses Kafka
Rulebricks uses three Kafka topics. Topic names carry the configured rulebricks.app.logging.kafkaTopicPrefix (default com.rulebricks., so the actual topic is com.rulebricks.solution and so on). The prefix exists so Rulebricks topics don't collide on shared or managed clusters; set it to "" to disable prefixing.
| Topic | Purpose | Producers | Consumers |
|---|---|---|---|
solution | Inbound solve requests | HPS API pods | Worker pods (generic-workers consumer group) |
solution-response | Outbound results routed back to originating HPS replica | Worker pods | HPS API pods (hps-response-consumer consumer group) |
logs | Structured decision logs | HPS and app pods | Vector aggregator (vector-consumers consumer group) |
The logs topic is only truly optional if both vector.enabled: false and rulebricks.app.logging.enabled: false. In the default chart configuration the Vector pod is deployed and consumes this topic, so it needs to exist on your cluster. See Vector and the logs topic below.
Messages are plain JSON: no schema registry, no Kafka Connect, no custom broker plugins. The client library is KafkaJS, which speaks the standard Kafka wire protocol.
Helm Values for External Kafka
To switch from the bundled Kafka to your own cluster, set the following in your values override:
kafka:
enabled: false
rulebricks:
app:
logging:
enabled: true
kafkaBrokers: 'broker-1.example.com:9092,broker-2.example.com:9092'The kafkaBrokers value drives every consumer in the stack:
- HPS, workers, and the main app receive it as the
KAFKA_BROKERSenvironment variable via the shared ConfigMap. WhenkafkaBrokersis empty (the default), the chart auto-generates the internal cluster address<release>-kafka.<namespace>.svc.cluster.local:9092. - Vector is wired automatically through a templated
vector-kafka-envConfigMap that derives its bootstrap servers, TLS/SASL settings, and the prefixed log topic from the samerulebricks.app.logging.*values. You do not configure Vector's Kafka connection by hand. (For token-auth mechanisms, the ConfigMap points Vector at the localkafkaBridgeproxy instead; see Authentication.) - KEDA's worker-scaling trigger also reads
kafkaBrokersand points at your external cluster automatically (see KEDA Autoscaling with External Kafka).
Authentication (SSL/SASL)
The chart supports TLS and SASL for external Kafka through rulebricks.app.logging.kafkaSsl and kafkaSasl, which it exposes to the app, HPS, workers, and Vector. Supported SASL mechanisms are plain, scram-sha-256, scram-sha-512, and aws-iam.
Two consumers connect to Kafka with different auth capabilities: HPS (KafkaJS, which handles all supported mechanisms including token-based ones) and Vector (which natively supports SSL plus SASL PLAIN/SCRAM, but not token mechanisms like IAM or OAUTHBEARER). For token mechanisms, the chart provides a kafkaBridge kafka-proxy sidecar that authenticates upstream using the Vector pod's workload identity and exposes a local plaintext listener Vector consumes from. The CLI configures all of this automatically when you externalize Kafka; the examples below are for hand-installs.
AWS MSK with IAM auth (credentials from pod identity via IRSA, no static secrets):
kafka:
enabled: false
rulebricks:
app:
logging:
enabled: true
kafkaBrokers: 'b-1.msk.example:9098,b-2.msk.example:9098'
kafkaSsl: true
kafkaSasl:
mechanism: 'aws-iam'
region: 'us-east-1'
hps:
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: 'arn:aws:iam::ACCOUNT:role/msk-access'
# Vector cannot speak MSK IAM directly, so it uses the kafka-proxy bridge:
kafkaBridge:
enabled: true
provider: 'aws'
region: 'us-east-1'
brokers: 'b-1.msk.example:9098,b-2.msk.example:9098'
awsRoleArn: 'arn:aws:iam::ACCOUNT:role/msk-access'
vector:
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: 'arn:aws:iam::ACCOUNT:role/msk-access'Azure Event Hubs (Kafka endpoint) uses SASL PLAIN with the namespace connection string. Both HPS and Vector connect directly; no bridge sidecar required:
kafka:
enabled: false
rulebricks:
app:
logging:
enabled: true
kafkaBrokers: 'my-namespace.servicebus.windows.net:9093'
kafkaSsl: true
kafkaSasl:
mechanism: 'plain'
username: '$ConnectionString'
password: 'Endpoint=sb://my-namespace.servicebus.windows.net/;SharedAccessKeyName=...;SharedAccessKey=...'
# or: existingSecret / existingSecretUsernameKey / existingSecretPasswordKeyGCP Managed Service for Apache Kafka prefers OAUTHBEARER with Workload Identity; the Vector bridge runs GCP's local auth-token server sidecar (kafkaBridge.provider: "gcp" with gcpServiceAccountEmail). A simpler plain/SCRAM credential also works for both consumers but uses static credentials.
Topics to Pre-Create
The HPS producer sets allowAutoTopicCreation: true, but auto-created topics inherit the broker's default partition count, often just 1. A single-partition solution-response will cause request timeouts under any meaningful replica count. Always pre-create topics explicitly.
In-cluster chart deployments handle all of this automatically via kafka.provisioning plus the kafka-topic-align Job; this section applies to external/managed Kafka, where topics are customer-managed.
Ask your Kafka team to create (names take your configured topic prefix):
| Topic | Partition Count | Replication | Retention |
|---|---|---|---|
solution | ~2x your maximum worker replica count (e.g. 48 for a 24-worker ceiling) | 1 is acceptable (transient RPC traffic) | Short: retention.ms=300000, segment.ms=300000, small segments |
solution-response | Must equal hps.workers.solutionPartitions | Same as solution | Same as solution |
logs | 8-24 depending on volume | 2-3 in production (long-lived data) | Size to your Vector outage tolerance, e.g. retention.ms=86400000 |
Set max.message.bytes to at least 2097152 (2 MB) on all three topics. HPS chunks are byte-bounded well below this, but the headroom prevents edge-case produce failures, and a response chunk can exceed its request chunk when rules expand payloads.
Partition count on solution is the worker fleet's concurrency ceiling, not a worker quota; the 2x headroom recommendation and the full sizing model are explained in Performance & Scaling.
For replication, 1 is acceptable on the RPC topics because the traffic is transient and the HPS producer uses acks=-1, so higher replication adds ISR-wait latency to every request. Use 2+ only if broker loss must never produce a brief window of request failures.
Partition Sizing for solution-response
The solution-response topic is partition-sensitive. HPS replicas share a consumer group over this topic, so each replica is assigned a subset of partitions. When producing a response, the worker writes to the exact partition the originating HPS replica is consuming. If the topic has fewer partitions than expected, responses land on partitions no replica is watching, causing 30-second timeouts.
The expected partition count is set in the Helm values:
rulebricks:
hps:
workers:
solutionPartitions: 64The HPS deployment template passes solutionPartitions to HPS as the MAX_WORKERS environment variable.
Keep solutionPartitions equal to the actual partition count of
solution-response on your external cluster. A mismatch here is the most
common cause of 30-second timeouts on solve requests.
If you raise a topic's partition count later, HPS Kafka clients observe the change within about 30 seconds (metadataMaxAge: 30000), and consumers fully rebalance onto new partitions at their next group rebalance. Chart upgrades roll the workers, which forces one. HPS pods gate readiness on the response consumer owning partitions (GET /ready returns 503 until group join completes), so traffic never reaches an instance that can't receive its responses.
ZooKeeper vs KRaft
HPS does not care how your Kafka cluster manages metadata. It never connects to ZooKeeper directly; it only speaks the Kafka wire protocol to brokers via a bootstrap address. Both ZooKeeper-backed clusters and KRaft-mode clusters work identically. Managed services like AWS MSK (in either mode), Confluent Cloud, Aiven, and Redpanda are all compatible.
The bundled Kafka subchart ships in KRaft mode (kraft.enabled: true, zookeeper.enabled: false), but this is a deployment choice for the internal broker and has no bearing on external cluster compatibility.
Tuning and Idempotency
The HPS Kafka client is pre-tuned for a low-latency request/response workload. These settings are baked into the application and do not need external configuration, but are worth understanding:
Producer:
- Idempotent mode (
idempotent: true,acks: -1) guarantees exactly-once produce semantics per session. If a network blip causes a retry, Kafka deduplicates the message at the broker. This is Kafka-level idempotency, not HTTP-level: a client retrying an HTTP request produces a new message with a new correlation ID. - Snappy compression reduces wire bytes. The broker must support Snappy (enabled by default on Apache Kafka and Confluent). If your cluster has disabled Snappy, contact Rulebricks support.
lingerMs: 0sends immediately. Latency is prioritized over batch throughput because these are synchronous, user-facing requests.
Consumer:
- The HPS response consumer uses
sessionTimeout: 60000,heartbeatInterval: 15000, tolerating brief idle periods without spurious rebalancing. - Workers use a tighter 30s session timeout with a 3s heartbeat (tunable via
WORKER_SESSION_TIMEOUT_MS/WORKER_HEARTBEAT_INTERVAL_MS) so a hung worker releases its partitions quickly. maxWaitTimeInMs: 50makes the broker return fetched messages quickly for low-latency response delivery.- Workers process up to 2 partitions concurrently per pod. Throughput scales by adding worker replicas, not by raising concurrency.
No tuning is needed on the external Kafka side beyond ensuring the three topics exist with correct partition counts and reasonable ISR settings (e.g., min.insync.replicas=1 for single-broker, 2 for multi-broker production).
Request Size Limits
Admission is byte-first; item counts are not limited by default. Useful when setting client expectations:
- Total request body: 6 MiB hard ceiling by default (HTTP 413 above it; configurable via
HTTP_BODY_LIMIT_BYTESon HPS). We recommend clients stay around 1 MB or less for the best latency profile. Larger bodies execute fine but cost proportionally more parse time and fan out into more chunks. - Single payload: 1.5 MiB serialized hard ceiling (HTTP 413 naming the offending array index; configurable via
ITEM_MAX_BYTES, which must stay below the topics'max.message.bytesminus envelope headroom). - Item count: unlimited by default. Operators can opt into a cap with
BULK_MAX_ITEMS(HTTP 400 above it). - Response amplification: each chunk's response must fit the topic's
max.message.bytes(2 MB default). Rules that expand outputs beyond roughly 16 KB average per payload will fail the request with an explicit error. - The whole request shares one 30-second execution deadline regardless of size.
KEDA Autoscaling with External Kafka
The KEDA ScaledObject for HPS workers monitors consumer lag on the solution topic. When you externalize Kafka, KEDA's bootstrapServers is automatically derived from your kafkaBrokers value. No additional KEDA configuration is needed.
However, ensure KEDA can reach your external Kafka brokers from within the cluster (network policies, VPC peering, security groups, etc.). If your Kafka requires SASL, KEDA's Kafka trigger also needs authentication; see the KEDA Kafka trigger docs (opens in a new tab) for details.
Vector and the logs topic
The Rulebricks chart ships a Vector pod by default as the consumer of the logs topic. HPS and the main app produce structured decision-log entries to Kafka after each request completes (non-blocking, post-response); Vector reads them and forwards to whatever sink the chart is configured for, commonly the object storage archive that ClickHouse queries, S3, or an HTTP endpoint into a SIEM.
A few things to know when externalizing Kafka:
- Consumer group. Vector joins Kafka as
vector-consumers(configured invector.customConfig.sources.kafka.group_id). This is the group ID to use inkafka-consumer-groups.shcommands and any ACL rules. - Consume-only. In the default chart configuration Vector only reads from the
logstopic; it does not produce back to Kafka. If your cluster enforces ACLs, the Vector principal needsReadandDescribeonlogswith group idvector-consumers, and noWrite. - Network reachability. Vector must be able to reach the external brokers from inside the cluster, the same as HPS and the workers. Check network policies, security groups, and VPC peering accordingly.
To verify Vector is healthy after pointing it at external Kafka:
kubectl get pods | grep vector
kubectl logs <vector-pod> | grep -i "kafka\|partition"
kafka-consumer-groups.sh --bootstrap-server <broker> \
--describe --group vector-consumersThe pod should be Running, the logs should show a successful Kafka connection and partition assignment at startup, and consumer-group lag on logs should stay near zero under normal load. Growing lag almost always means Vector cannot reach the brokers or is missing ACL permissions.
Disabling Kafka Entirely
Setting kafka.enabled: false without providing external kafkaBrokers effectively disables HPS. All solve endpoints return HTTP 503. Only the /health endpoint continues to respond. This is only useful for running a control-plane-only deployment (dashboard and admin APIs without rule execution).
It also breaks the decision-log pipeline: Vector loses its data source, so any downstream sink (ClickHouse, S3, SIEM) will see a complete gap for the duration of the Kafka outage. See Vector and the logs topic for the consumer-side details.
Externalizing Redis
How Rulebricks Uses Redis
Redis serves as a shared cache between all Rulebricks components: the main app, HPS, and workers. It sits in a three-tier caching hierarchy:
- L1: In-process LRU (per pod, always present)
- L2: Redis (shared across all pods and replicas)
- L3: Supabase (source of truth, queried on full cache miss)
What lives in Redis:
| Data | TTL | Written By |
|---|---|---|
| API key auth payloads | 60s | HPS |
| Rule/flow definitions (compressed) | 180s | HPS, workers |
| Named-environment release mappings | No expiry | Main app (not HPS) |
| Flow node results (API, SOAP, DB, vault) | 1–300s | Workers |
Redis is accessed using only basic commands: GET, SET with EX, EXPIRE, and DEL. No Lua scripts, no pub/sub, no streams. A vanilla Redis instance is all that's needed, with no clustering, persistence configuration, or eviction policy requirements.
What You Provide
To use your own Redis instance, you provide one thing: a Redis host and port.
rulebricks:
redis:
enabled: false
external:
host: 'your-redis.example.com'
port: 6379
password: 'your-password'
# existingSecret: "my-redis-secret"
# existingSecretKey: "redis-password"
tls:
enabled: falseSetting redis.enabled: false stops the chart from deploying its own internal Redis pod and PVC. The chart takes care of everything else; all internal components are wired to your Redis automatically.
What the Chart Does Behind the Scenes
The Rulebricks stack has two types of Redis consumers:
- HPS and workers connect to Redis directly via the native Redis protocol (
ioredis). This is the fast path (~1–2ms per operation) and is configured via theREDIS_URLenvironment variable, which the chart constructs from yourexternal.hostandexternal.port. - The main app (Next.js) uses an HTTP-based Redis client (
@vercel/kv). It cannot connect to Redis natively, so it needs an HTTP translation layer.
To bridge this, the chart always deploys a lightweight internal proxy (serverless-redis-http) that speaks HTTP on one side and the Redis protocol on the other. When you externalize Redis, this proxy is automatically pointed at your external instance. You do not need to configure or think about it; it's an internal implementation detail.
In short: you provide a standard Redis endpoint, and the chart handles routing each component to it through the appropriate protocol.
Connection Examples
| Provider | host | port | tls.enabled |
|---|---|---|---|
| AWS ElastiCache (no TLS) | primary.my-cluster.use1.cache.amazonaws.com | 6379 | false |
| AWS ElastiCache (TLS) | primary.my-cluster.use1.cache.amazonaws.com | 6379 | true |
| AWS MemoryDB | clustercfg.my-cluster.use1.memorydb.amazonaws.com | 6379 | true |
| Redis Cloud | redis-12345.c1.us-east-1.redns.redis-cloud.com | 12345 | true |
| GCP Memorystore | 10.x.x.x | 6379 | false |
| Self-hosted (another namespace) | redis-svc.other-ns.svc.cluster.local | 6379 | false |
The Redis client is pre-configured with TCP keepalive, auto-pipelining, and exponential-backoff retries. No client-side tuning is typically required.
Disabling Redis Entirely
If redis.enabled is false and no external host is provided, Redis operates in a no-op mode where cache writes are silently discarded and reads return empty. HPS will start and log a warning, but consequences are significant:
- Every authentication and entity lookup hits Supabase directly. Expect a ~60x increase in database queries at steady state.
- Named-environment URLs stop working. Requests like
/api/v1/solve/my-rule/prodreturn 404 because thereleases_*cache keys are only written by the main app into Redis. Numeric versions andlateststill work. - Flow-node caching is lost. API, SOAP, and database nodes inside flows execute on every invocation.
Supabase
The Supabase stack itself stays in-cluster (or on Supabase Cloud); what you can externalize is the PostgreSQL server underneath it. Set supabase.db.enabled: false and configure supabase.externalDatabase.* to point the Supabase services and migration jobs at a server you manage. See External PostgreSQL for the values.
The full self-hosted Supabase stack includes the following services:
- PostgreSQL stores users, teams, rules, flows, API keys, usage data, and all application state.
- GoTrue (Auth) handles user signup, login, password recovery, email verification, SSO/OIDC, and JWT issuance. The main app delegates all authentication to Supabase Auth.
- PostgREST provides the REST API layer over PostgreSQL that the application queries.
- Realtime powers live-update features in the dashboard.
- Kong is the API gateway that routes and authenticates requests to the above services.
These services stay chart-managed even with an external database. Beyond providing a clear database layer for Rulebricks, Supabase is our interface for authentication, JWT management, and real-time features. There are database triggers between Supabase-managed identity tables (e.g., auth.users) and Rulebricks-managed application tables that depend on Supabase's internal wiring, so the external server must be a vanilla PostgreSQL that the Supabase services can fully manage.
It is worth noting the database is likely not a bottleneck. Because Redis aggressively caches entity definitions, and every pod has an in-process LRU in front of Redis, actual database read/write volumes under normal operation are generally minimal.
Database Backups
While the Supabase PostgreSQL instance is lightweight, it holds all application state and should be backed up regularly if self-hosting. The chart manages this for you with scheduled Barman backups to shared object storage, plus on-demand backup and restore through the CLI. See Storage & Backups.
Verification Checklist
After switching to external Kafka or Redis, verify the deployment is healthy:
1. HPS health endpoints
HPS exposes two endpoints: /health (process-alive, used by the liveness probe) and /ready (used by the readiness probe; returns 503 until the Kafka response consumer has joined its group and owns partitions, so pods never receive traffic before they can route responses).
kubectl port-forward svc/<release>-hps 3000:3000
curl -s http://localhost:3000/health | jq
curl -si http://localhost:3000/ready | head -1Expected from /health:
{
"status": "ok",
"redis": true,
"kafka": {
"enabled": true,
"ready": true,
"partitions": 21
}
}redis: trueconfirms the KV client connected. Iffalse, check yourREDIS_URLorredis.external.*values.kafka.ready: trueconfirms the consumer joined its group. Iffalseon startup, retry after 10 to 15 seconds; group join can take a moment./readyreturns 200 once this completes.kafka.partitionsshows how manysolution-responsepartitions this pod owns. In a 3-replica HPS against a 64-partition topic, each pod should report ~21.
2. End-to-end smoke test
curl -X POST http://localhost:3000/api/v1/solve/<rule-slug>/latest \
-H "x-api-key: $API_KEY" \
-H "content-type: application/json" \
-d '{"input": "value"}'A 200 response confirms Kafka, Redis, Supabase, and the worker pipeline are all healthy.
3. Common failure modes
| Symptom | Likely Cause | Fix |
|---|---|---|
| All solve requests return 503 | Kafka disabled or unreachable | Check kafkaBrokers, verify network connectivity to brokers |
| Requests timeout after 30s | solution-response partition count ≠ solutionPartitions | Raise the topic's partition count, or update solutionPartitions to match |
| Named-environment URLs return 404 | Redis not connected, or main app and HPS point at different Redis instances | Verify redis: true on /health; ensure shared Redis |
Persistent consumer lag on solution | Not enough worker replicas | Scale up hps.workers.replicas or hps.workers.keda.minReplicaCount |
| Decision logs stop reaching ClickHouse / S3 / SIEM after externalizing Kafka | Vector can't reach or authenticate to the external broker | Check Vector pod logs and network reachability; for token-auth Kafka verify the kafkaBridge sidecar is enabled |