-
Notifications
You must be signed in to change notification settings - Fork 371
feat(controller): support HA deployment #1133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables high availability (HA) deployments of the kagent controller by supporting multiple replicas with leader election and shared state via PostgreSQL. The changes refactor configuration management to use environment variables (allowing Secrets for sensitive database credentials), add leader election RBAC resources, and restructure A2A API registration to work across all replicas rather than only in the reconciliation loop.
- Introduces environment variable-based configuration with the
LoadFromEnvfunction, enabling Secret-based database URL configuration - Adds Helm validation preventing SQLite use with multiple controller replicas, and auto-enables leader election when replicas > 1
- Refactors A2A handler registration from reconciler-based to cache informer-based via
A2ARegistrar, ensuring A2A APIs are available on all replicas regardless of leader election status
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| helm/kagent/values.yaml | Adds envFrom field to controller configuration and fixes indentation in volume/volumeMount comment examples |
| helm/kagent/templates/rbac/leader-election-role.yaml | New Role granting permissions for leader election (leases and events) |
| helm/kagent/templates/rbac/leader-election-rolebinding.yaml | New RoleBinding associating leader election Role with controller ServiceAccount |
| helm/kagent/templates/controller-deployment.yaml | Refactors to use ConfigMap-based environment variables, conditionalizes volumes/volumeMounts, and removes inline command args |
| helm/kagent/templates/controller-configmap.yaml | New ConfigMap consolidating all controller configuration as environment variables, including leader election flag based on replica count |
| helm/kagent/templates/_helpers.tpl | Adds validation preventing SQLite use with multiple replicas |
| go/pkg/app/app_test.go | Comprehensive tests for LoadFromEnv function covering string, bool, duration, and quantity flag types |
| go/pkg/app/app.go | Implements LoadFromEnv to load configuration from environment variables and restructures A2A registration to use new registrar |
| go/internal/controller/translator/agent/adk_api_translator.go | Extracts agent card building logic to buildAgentCard method and adds TranslateAgentCard interface method for use by A2A registrar |
| go/internal/controller/reconciler/reconciler.go | Removes A2A reconciler dependency as A2A registration is now handled by separate registrar component |
| go/internal/controller/a2a/a2a_reconciler.go | Deleted file - reconciler-based A2A registration replaced with cache-based registrar |
| go/internal/a2a/a2a_registrar.go | New registrar component using cache informer to register/update/remove A2A handlers on all replicas |
| contrib/addons/postgres.yaml | Adds PostgreSQL deployment for testing HA controller setup with shared database |
| Makefile | Adds Postgres to addon installation and removes outdated Istio reference from comments |
Comments suppressed due to low confidence (1)
helm/kagent/templates/controller-deployment.yaml:35
- The sqlite-volume is defined unconditionally (lines 32-34) even when
database.typeis "postgres". This will create an unused volume. The volume definition should also be conditionally included based on the database type.
Suggestion: Wrap the sqlite-volume definition with {{- if eq .Values.database.type "sqlite" }} and update the outer condition on line 30 to be: {{- if or (eq .Values.database.type "sqlite") (gt (len .Values.controller.volumes) 0) }}
- name: sqlite-volume
emptyDir:
sizeLimit: 500Mi
medium: Memory
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
194dccf to
72dac59
Compare
ec1572f to
c0f9be4
Compare
c0f9be4 to
57aa289
Compare
This PR ensures that the log level specified when installing kagent is respected by the controller. Currently this is not respected because when `Development: true` is set, controller-runtime's zap logger enables verbosity level (`V(1)`) logs by default, which get logged as [`log.V(1).Info(...)`](https://github.com/kagent-dev/kagent/blob/9438c9c0f2c79daf632555df1d7d3cb2d04b7b81/go/internal/httpserver/handlers/health.go#L21) but displayed as `DEBUG` in the output. `-zap-log-level` controls the minimum log level (info, error, etc.), but does NOT control the verbosity threshold for V(n) logs. Note: - Switching away from development logging by default also switches the default log format from `ConsoleEncoder` to `JsonEncoder` (see examples below)[^1]. Pre this PR: ``` 2025-11-26T07:07:20Z INFO Starting KAgent Controller {"version": "v0.7.5-1-g9438c9c0", "git_commit": "9438c9c0", "build_date": "2025-11-26"} 2025-11-26T07:07:20Z INFO Config {"config": {"Metrics":{"Addr":"0","CertPath":"","CertName":"tls.crt","CertKey":"tls.key"},"Streaming":{"MaxBufSize":"1Mi","InitialBufSize":"4Ki","Timeout":600000000000},"LeaderElection":false,"ProbeAddr":":8082","SecureMetrics":true,"EnableHTTP2":false,"DefaultModelConfig":{"Namespace":"kagent","Name":"default-model-config"},"HttpServerAddr":":8083","WatchNamespaces":"","A2ABaseUrl":"http://127.0.0.1:8083","Database":{"Type":"sqlite","Path":"/sqlite-volume/kagent.db","Url":"postgres://postgres:kagent@db.kagent.svc.cluster.local:5432/crud"}}} 2025-11-26T07:07:20Z INFO Updating GOMAXPROCS=2: determined from CPU quota 2025-11-26T07:07:20Z INFO setup Starting KAgent Controller {"version": "v0.7.5-1-g9438c9c0", "git_commit": "9438c9c0", "build_date": "2025-11-26"} 2025-11-26T07:07:20Z INFO setup Watching all namespaces (no valid namespaces specified) 2025-11-26T07:07:20Z INFO MCPServer CRD not found - controller will not be started {"controller": "mcpserver"} 2025-11-26T07:07:20Z INFO Could not find CRD for tool server - API integration will be disabled {"toolServerType": "MCPServer.kagent.dev"} 2025-11-26T07:07:20Z INFO setup starting manager 2025-11-26T07:07:20Z INFO starting server {"name": "health probe", "addr": "[::]:8082"} 2025-11-26T07:07:20Z INFO http-server Starting HTTP server {"address": ":8083"} 2025-11-26T07:07:20Z INFO Starting EventSource {"controller": "service", "controllerGroup": "", "controllerKind": "Service", "source": "kind source: *v1.Service"} 2025-11-26T07:07:20Z INFO Starting Controller {"controller": "service", "controllerGroup": "", "controllerKind": "Service"} 2025-11-26T07:07:20Z INFO Starting workers {"controller": "service", "controllerGroup": "", "controllerKind": "Service", "worker count": 1} 2025-11-26T07:07:20Z INFO Starting EventSource {"controller": "modelconfig", "controllerGroup": "kagent.dev", "controllerKind": "ModelConfig", "source": "kind source: *v1.Secret"} 2025-11-26T07:07:20Z INFO Starting EventSource {"controller": "agent", "controllerGroup": "kagent.dev", "controllerKind": "Agent", "source": "kind source: *v1.Service"} 2025-11-26T07:07:20Z INFO setup starting pprof server 2025-11-26T07:07:20Z INFO setup pprof server started {"address": ":6060"} ... 2025-11-26T07:07:20Z INFO Starting Controller {"controller": "agent", "controllerGroup": "kagent.dev", "controllerKind": "Agent"} 2025-11-26T07:07:20Z INFO Starting workers {"controller": "agent", "controllerGroup": "kagent.dev", "controllerKind": "Agent", "worker count": 1} 2025-11-26T07:07:20Z INFO Starting Controller {"controller": "remotemcpserver", "controllerGroup": "kagent.dev", "controllerKind": "RemoteMCPServer"} 2025-11-26T07:07:20Z INFO Starting workers {"controller": "remotemcpserver", "controllerGroup": "kagent.dev", "controllerKind": "RemoteMCPServer", "worker count": 1} 2025-11-26T07:07:48Z DEBUG http Request started {"method": "GET", "path": "/health", "remote_addr": "10.244.0.1:34006"} 2025-11-26T07:07:48Z DEBUG http.health-handler Handling health check request {"method": "GET", "path": "/health", "remote_addr": "10.244.0.1:34006"} 2025-11-26T07:07:48Z INFO http Request completed {"method": "GET", "path": "/health", "remote_addr": "10.244.0.1:34006", "status": 200, "duration": "117.034µs"} 2025-11-26T07:07:49Z DEBUG http Request started {"method": "GET", "path": "/health", "remote_addr": "10.244.0.1:47646"} 2025-11-26T07:07:49Z DEBUG http.health-handler Handling health check request {"method": "GET", "path": "/health", "remote_addr": "10.244.0.1:47646"} 2025-11-26T07:07:49Z INFO http Request completed {"method": "GET", "path": "/health", "remote_addr": "10.244.0.1:47646", "status": 200, "duration": "115.967µs"} ``` Post this PR (without log level being overriden as we do when running locally - note the missing "debug" logs): ``` {"level":"info","ts":"2025-11-26T07:32:28Z","logger":"setup","msg":"Starting KAgent Controller","version":"v0.7.5-3-g44e8c6d3","git_commit":"44e8c6d3","build_date":"2025-11-26","config":{"Metrics":{"Addr":"0","CertPath":"","CertName":"tls.crt","CertKey":"tls.key"},"Streaming":{"MaxBufSize":"1Mi","InitialBufSize":"4Ki","Timeout":600000000000},"LeaderElection":false,"ProbeAddr":":8082","SecureMetrics":true,"EnableHTTP2":false,"DefaultModelConfig":{"Namespace":"kagent","Name":"default-model-config"},"HttpServerAddr":":8083","WatchNamespaces":"","A2ABaseUrl":"http://127.0.0.1:8083","Database":{"Type":"sqlite","Path":"/sqlite-volume/kagent.db","Url":"postgres://postgres:kagent@db.kagent.svc.cluster.local:5432/crud"}}} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Updating GOMAXPROCS=2: determined from CPU quota"} {"level":"info","ts":"2025-11-26T07:32:28Z","logger":"setup","msg":"Watching all namespaces (no valid namespaces specified)"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"MCPServer CRD not found - controller will not be started","controller":"mcpserver"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Could not find CRD for tool server - API integration will be disabled","toolServerType":"MCPServer.kagent.dev"} {"level":"info","ts":"2025-11-26T07:32:28Z","logger":"setup","msg":"starting manager"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"starting server","name":"health probe","addr":"[::]:8082"} {"level":"info","ts":"2025-11-26T07:32:28Z","logger":"http-server","msg":"Starting HTTP server","address":":8083"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting EventSource","controller":"service","controllerGroup":"","controllerKind":"Service","source":"kind source: *v1.Service"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting Controller","controller":"service","controllerGroup":"","controllerKind":"Service"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting workers","controller":"service","controllerGroup":"","controllerKind":"Service","worker count":1} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting EventSource","controller":"modelconfig","controllerGroup":"kagent.dev","controllerKind":"ModelConfig","source":"kind source: *v1.Secret"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting EventSource","controller":"agent","controllerGroup":"kagent.dev","controllerKind":"Agent","source":"kind source: *v1.Service"} {"level":"info","ts":"2025-11-26T07:32:28Z","logger":"setup","msg":"starting pprof server"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting EventSource","controller":"agent","controllerGroup":"kagent.dev","controllerKind":"Agent","source":"kind source: *v1.Deployment"} {"level":"info","ts":"2025-11-26T07:32:28Z","logger":"setup","msg":"pprof server started","address":":6060"} ... {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting Controller","controller":"modelconfig","controllerGroup":"kagent.dev","controllerKind":"ModelConfig"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting workers","controller":"modelconfig","controllerGroup":"kagent.dev","controllerKind":"ModelConfig","worker count":1} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting Controller","controller":"agent","controllerGroup":"kagent.dev","controllerKind":"Agent"} {"level":"info","ts":"2025-11-26T07:32:28Z","msg":"Starting workers","controller":"agent","controllerGroup":"kagent.dev","controllerKind":"Agent","worker count":1} {"level":"info","ts":"2025-11-26T07:32:55Z","logger":"http","msg":"Request completed","method":"GET","path":"/health","remote_addr":"10.244.0.1:34516","status":200,"duration":0.000057265} {"level":"info","ts":"2025-11-26T07:32:55Z","logger":"http","msg":"Request completed","method":"GET","path":"/health","remote_addr":"10.244.0.1:34524","status":200,"duration":0.000050881} ``` [^1]: If anyone is opposed to JSON logs locally then we can reconfigure these locally but this will require updates to the Helm chart that I'm trying to avoid pending the switch to using env vars instead of args in #1133. Specifically, I don't want to add `controller.args` to the values now when that could subsequently be removed (breaking change) in the linked PR. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Split this out of #1133 to try reduce the size of that PR - but also because it's not strictly related to being able to scale the controller - it simply manifested when needing to switch to postgres when running multiple controller replicas. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
aa726b5 to
14211de
Compare
14211de to
d93d1db
Compare
d93d1db to
bddca41
Compare
Another artifact of #1133. No need for the sqlite volume+mount when database is set to postgres. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
…te database (#1144) Running multiple controller replicas when using a local SQLite database will lead to errors as API requests will inevitably end up being handled by a replicas that does not have the local state (e.g. A2A session). This check/error hopefully prevents users from making this mistake. Split out from #1133 Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Enables local testing using postgres as a backing store for controller. Split out from #1133 (with added docs). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
**Yet another PR split out from #1133 to try reduce review burden** - keeping that one open for now as all of these other PRs are ultimately working towards that goal. This PR refactors the kagent controller to support the use of environment variables for configuration in addition to command-line arguments. It also updates the Helm chart to make use of env vars instead of command line args and adds the ability for user's to supply their own environment variables with custom configuration. This allows users to supply sensitive configuration (e.g. postgres database url) via secrets instead of exposing these via `args`. Env vars are also easier to patch when working with rendered manifests if needed. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Removed A2A "reconciler" and replaced with a registrar that is responsible for registering/deregistering mux handlers. Main controller still manages cluster resources and DB/status. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
…oller Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
bddca41 to
495e915
Compare
|
Closing since everything in here has now been split out and only 2 open PRs remain. |
This PR enables leader election on the controller if it is configured with one than 1 replica to ensure that only 1 replica is actively reconciling watched manifests. It also ensures that the necessary RBAC manifests are created. Final part of #1133 (excluding #1138). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
…econcilation (#1138) **Decided to split this out of #1133 to try make review a little easier as it's a chunky commit that can live in isolation of the rest of the changes in that PR** This change separates A2A handler registration from the main `Agent` controller reconciliation loop by introducing a dedicated `A2ARegistrar` that manages the A2A routing table independently from the main controller. Currently, A2A handler registration is tightly coupled to the `Agent` controller's reconciliation loop, which performs the following operations: 1. Reconcile Kubernetes resources (Deployment, Service, etc.) 2. Store agent metadata in database 3. Register A2A handler in routing table 4. Update resource status This coupling is problematic for a number of reasons: 1. Breaks horizontal scaling - with leader election enabled (required to prevent duplicate reconciliation), only the leader pod performs reconciliation and registers A2A handlers. When API requests hit non-leader replicas, they fail because those replicas lack the necessary handler registrations. 2. Could be argued that this violates separation of concerns - the controller handles both cluster resource management (its core responsibility) and API routing configuration (an orthogonal concern). 3. Makes future architectural changes (e.g., splitting API and control plane) unnecessarily complex. This PR attempts to address those concerns ensuring that all controller replicas, when scaled, will maintain consistent A2A routing tables enabling transparent load balancing across replicas. A2A logic is also consolidated into a dedicated package rather than scattered across controller code ensuring a clean separation of API and control plane such that these could be split into independent deployments without significant refactoring in future. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
…ent-dev#1137) Split this out of kagent-dev#1133 to try reduce the size of that PR - but also because it's not strictly related to being able to scale the controller - it simply manifested when needing to switch to postgres when running multiple controller replicas. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…-dev#1140) Another artifact of kagent-dev#1133. No need for the sqlite volume+mount when database is set to postgres. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…te database (kagent-dev#1144) Running multiple controller replicas when using a local SQLite database will lead to errors as API requests will inevitably end up being handled by a replicas that does not have the local state (e.g. A2A session). This check/error hopefully prevents users from making this mistake. Split out from kagent-dev#1133 Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Enables local testing using postgres as a backing store for controller. Split out from kagent-dev#1133 (with added docs). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
**Yet another PR split out from kagent-dev#1133 to try reduce review burden** - keeping that one open for now as all of these other PRs are ultimately working towards that goal. This PR refactors the kagent controller to support the use of environment variables for configuration in addition to command-line arguments. It also updates the Helm chart to make use of env vars instead of command line args and adds the ability for user's to supply their own environment variables with custom configuration. This allows users to supply sensitive configuration (e.g. postgres database url) via secrets instead of exposing these via `args`. Env vars are also easier to patch when working with rendered manifests if needed. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
This PR enables leader election on the controller if it is configured with one than 1 replica to ensure that only 1 replica is actively reconciling watched manifests. It also ensures that the necessary RBAC manifests are created. Final part of kagent-dev#1133 (excluding kagent-dev#1138). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…econcilation (kagent-dev#1138) **Decided to split this out of kagent-dev#1133 to try make review a little easier as it's a chunky commit that can live in isolation of the rest of the changes in that PR** This change separates A2A handler registration from the main `Agent` controller reconciliation loop by introducing a dedicated `A2ARegistrar` that manages the A2A routing table independently from the main controller. Currently, A2A handler registration is tightly coupled to the `Agent` controller's reconciliation loop, which performs the following operations: 1. Reconcile Kubernetes resources (Deployment, Service, etc.) 2. Store agent metadata in database 3. Register A2A handler in routing table 4. Update resource status This coupling is problematic for a number of reasons: 1. Breaks horizontal scaling - with leader election enabled (required to prevent duplicate reconciliation), only the leader pod performs reconciliation and registers A2A handlers. When API requests hit non-leader replicas, they fail because those replicas lack the necessary handler registrations. 2. Could be argued that this violates separation of concerns - the controller handles both cluster resource management (its core responsibility) and API routing configuration (an orthogonal concern). 3. Makes future architectural changes (e.g., splitting API and control plane) unnecessarily complex. This PR attempts to address those concerns ensuring that all controller replicas, when scaled, will maintain consistent A2A routing tables enabling transparent load balancing across replicas. A2A logic is also consolidated into a dedicated package rather than scattered across controller code ensuring a clean separation of API and control plane such that these could be split into independent deployments without significant refactoring in future. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
This PR aims to make it possible to scale the number of replicas for the controller - enabling a HA deployment of it. The current shortcomings it address include:
Secretto configure a Postgres database URLCode can be tested locally by running: