feat: Introduce kube-proxy replacement configuration and related functionality#5
feat: Introduce kube-proxy replacement configuration and related functionality#5berkayoz wants to merge 23 commits into
Conversation
1582ca4 to
459e567
Compare
e66c896 to
56ccc7c
Compare
982264f to
4f75852
Compare
- Added NodeConfig struct to encapsulate Kubelet and Network configurations. - Implemented ToConfigMap and FromConfigMap methods for NodeConfig to handle serialization and deserialization. - Updated tests to cover new NodeConfig functionality, including signing and verification. - Modified existing Kubelet configuration handling to support the new NodeConfig structure. - Enhanced validation logic to ensure kube-proxy-free cannot be set to false when the network is enabled. - Updated cluster configuration defaults to include kube-proxy-free setting.
Signed-off-by: Homayoon (Hue) Alimohammadi <homayoon.alimohammadi@canonical.com>
4f75852 to
c2fa6c4
Compare
c2fa6c4 to
6395706
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a kube-proxy replacement configuration knob and migrates service management toward a persisted per-node “local state” file, enabling controllers/hooks to start/stop (notably kube-proxy) based on reconciled configuration.
Changes:
- Add
network.kube-proxy-enabledto cluster config (defaults/merge/validate/CLI/API) and propagate it to nodes via a signed ConfigMap. - Introduce
local-state.yamlto persist enabled/disabled snap services and use it to start/restart/stop services from hooks/controllers. - Update Cilium network install values to enable kube-proxy replacement mode and adjust health-check settings.
Reviewed changes
Copilot reviewed 43 out of 44 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/snap/util/services.go | Removed legacy fixed service lists and wrappers. |
| pkg/snap/util/services_test.go | Removed tests for legacy service wrappers. |
| pkg/snap/util/local_state.go | Added local state model + read/write + start/restart/stop helpers. |
| pkg/snap/util/local_state_test.go | Added unit tests for local state behavior. |
| pkg/snap/snap.go | Added LocalStatePath() implementation. |
| pkg/snap/mock/mock.go | Added mock support for LocalStatePath(). |
| pkg/snap/interface.go | Extended Snap interface with LocalStatePath(). |
| pkg/k8sd/types/cluster_config_validate.go | Added validation around kube-proxy-enabled vs network enabled. |
| pkg/k8sd/types/cluster_config_validate_test.go | Added tests for the new validation rule. |
| pkg/k8sd/types/cluster_config_node.go | Added ClusterConfig↔ConfigMap conversion with signing and kube-proxy-enabled support. |
| pkg/k8sd/types/cluster_config_node_test.go | Added tests for configmap conversion/signing/round-trips. |
| pkg/k8sd/types/cluster_config_network.go | Added KubeProxyEnabled field + accessor. |
| pkg/k8sd/types/cluster_config_merge.go | Enabled merging of Network.KubeProxyEnabled. |
| pkg/k8sd/types/cluster_config_merge_test.go | Updated merge tests to include kube-proxy-enabled scenarios. |
| pkg/k8sd/types/cluster_config_kubelet.go | Removed kubelet-only configmap/signing logic (moved to node config). |
| pkg/k8sd/types/cluster_config_kubelet_test.go | Removed tests for kubelet-only configmap/signing. |
| pkg/k8sd/types/cluster_config_defaults.go | Added defaulting for Network.KubeProxyEnabled. |
| pkg/k8sd/types/cluster_config_defaults_test.go | Updated defaults tests for kube-proxy-enabled default behavior. |
| pkg/k8sd/types/cluster_config_convert.go | Wired kube-proxy-enabled through user-facing conversions. |
| pkg/k8sd/types/cluster_config_convert_test.go | Updated conversion tests for kube-proxy-enabled. |
| pkg/k8sd/features/cilium/network.go | Enabled kube-proxy replacement in Cilium values and triggers a rollout restart. |
| pkg/k8sd/controllers/update_node_configuration.go | Switched to ClusterConfig→ConfigMap for node config propagation. |
| pkg/k8sd/controllers/update_node_configuration_test.go | Updated tests for new node configmap conversion. |
| pkg/k8sd/controllers/service_args.go | Service list selection now derived from local state. |
| pkg/k8sd/controllers/node_configuration.go | Node controller now handles kube-proxy enable/disable + persists it to local state. |
| pkg/k8sd/controllers/node_configuration_test.go | Added kube-proxy-enabled reconciliation tests + local state setup. |
| pkg/k8sd/app/hooks_start.go | Start hook now prefers local state and performs legacy migration to local state. |
| pkg/k8sd/app/hooks_remove.go | Remove hook uses new StopAllK8sServices and deletes local state. |
| pkg/k8sd/app/hooks_node_ready.go | Triggers update-node-config reconciliation on node-ready. |
| pkg/k8sd/app/hooks_join.go | Writes local state on join and starts services via local state. |
| pkg/k8sd/app/hooks_bootstrap.go | Writes local state on bootstrap and starts services via local state. |
| pkg/k8sd/app/cluster_util.go | Removed legacy startControlPlaneServices helper. |
| pkg/k8sd/api/worker.go | Worker join info now includes kube-proxy-enabled. |
| pkg/k8sd/api/certificates_update.go | Restart control-plane services via local state. |
| pkg/k8sd/api/certificates_refresh.go | Restart control-plane services via local state. |
| go.mod | Bumped github.com/canonical/k8s-snap-api/v2 to v2.1.0. |
| go.sum | Updated sums for k8s-snap-api/v2 v2.1.0. |
| cmd/k8s/testdata/bootstrap-config-full.yaml | Added kube-proxy-enabled to example bootstrap config. |
| cmd/k8s/k8s_set.go | Allowed setting network.kube-proxy-enabled. |
| cmd/k8s/k8s_set_test.go | Added mapstructure tests for network.kube-proxy-enabled. |
| cmd/k8s/k8s_get.go | Added get output for network.kube-proxy-enabled. |
| cmd/k8s/k8s_enable.go | Network enable now sets kube-proxy-enabled default false. |
| cmd/k8s/k8s_bootstrap.go | Bootstrap defaults include kube-proxy-enabled false when enabling network. |
| cmd/k8s/k8s_bootstrap_test.go | Updated bootstrap CLI tests for kube-proxy-enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Network fields | ||
| v := config.Network.KubeProxyEnabled | ||
| if v == nil { | ||
| data["kube-proxy-enabled"] = fmt.Sprintf("%t", true) | ||
| } else { | ||
| data["kube-proxy-enabled"] = fmt.Sprintf("%t", *v) | ||
| } |
| // Parse Network fields | ||
| if v, ok := m["kube-proxy-enabled"]; ok { | ||
| kubeProxyEnabled := v == "true" | ||
| config.Network.KubeProxyEnabled = &kubeProxyEnabled | ||
| } else { | ||
| // Default to true if not specified | ||
| kubeProxyEnabled := true | ||
| config.Network.KubeProxyEnabled = &kubeProxyEnabled | ||
| } |
berkayoz
left a comment
There was a problem hiding this comment.
Thanks for pushing this over the finish line, a few comments
| if config.Network.KubeProxyEnabled == nil { | ||
| // if kube-proxy-enabled is nil do not show the nil value. | ||
| kubeProxyEnabled := config.Network.GetKubeProxyEnabled() | ||
| config.Network.KubeProxyEnabled = &kubeProxyEnabled | ||
| } |
There was a problem hiding this comment.
Is this actually needed, don't we already ensure config.Network.GetKubeProxyEnabled() doesn't return nil? https://github.com/canonical/k8s-snap-api/blob/777922c7fa9b77c2936f3fddea530499888da72c/api/type_cluster_config.go#L153-L155
Below on the output assignment this should be fine already no?
There was a problem hiding this comment.
The config as returned in the response can have nil at the KubeProxyEnabled filed (we have removed the omit empty in the k8s-snap-api). If the user asks for the entire config with k8s get or just the network section with k8s get network the config is returned as is. We do not call the config.Network.GetKubeProxyEnabled() anywhere below so the nil value is shown to the end user. We do not want that. We want to make sure that if there is a nil value it is shown as "true".
| log.Info("Have kube-proxy start and marked as active") | ||
| if err := snap.StartServices(ctx, []string{"kube-proxy"}); err != nil { | ||
| log.Error(err, "failed to start kube-proxy service") | ||
| } |
There was a problem hiding this comment.
A removed node should return to the same state as what we have after snap install right? Why would we need to re-enable kube-proxy if its not setup yet, we will be missing the files and everything no?
There was a problem hiding this comment.
There is this test test_skip_services_stop_on_remove in the k8s-snap that expects kube-proxy to be enabled. I removed this part of the code from here and will handle the specifics of the test in the k8s-snap.
| localState, err := snaputil.ReadLocalState(c.snap) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to determine node type: %w", err) | ||
| // Local state file doesn't exist, use old behavior |
There was a problem hiding this comment.
How do we use the old behavior here?
There was a problem hiding this comment.
You are right. There is no way to fallback to the old behavior. The respective functions have been removed. I revised the code so if the local state file is not present it fails with a message essentially waiting for the state file to be created.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
TL;DR: solid for fresh clusters, but upgrade/mixed-version paths land in the kube-proxy + KPR conflict state, every k8sd restart rollout-restarts Cilium cluster-wide, and the local-state not-exist check is dead code.
This review was generated by Claude 5 Fable (xhigh) after consuming around 80k tokens. The quality however was SO low in the end that I'd rather resolve everything. It had all the context it needed and the prompt was fairly comprehensive, trust me.
HomayoonAlimohammadi
left a comment
There was a problem hiding this comment.
Looks great, thanks a lot to both of you for this PR!
However, I'd lean towards leaving the local state out for now, until we discuss it further. I tried to explain why I think we're better off leaving it out, but TLDR: What we're storing there is not really "local" state (it's cluster wide); Compared to what this PR should entail that's a significant change (a lot of the changed lines are related to local state); I believe it needs further discussion and polishing.
If the team's decision is to keep the current local state impl, I'm all for it, but I'd personally vote against it for now. Not trying to block change behind annoying specs and buy trouble, but I think this at least deserves a short engineering discussion.
|
|
||
| restartFn := func(ctx context.Context) error { | ||
| if err := snaputil.RestartControlPlaneServices(ctx, snap); err != nil { | ||
| localState, err := snaputil.ReadLocalState(snap) |
There was a problem hiding this comment.
why do we need this?
There was a problem hiding this comment.
now I understand. we want to keep track of whether or not kubeproxy is disabled.
instead of introducing this "local state" which to me looks like a significant addition, and might require a separate thread (discussion, alternative solutions), can't we just pass a kubeProxyEnabled bool param to the StartControlPlaneServices function?
There was a problem hiding this comment.
I see how this local state can be utilized. I see the value in it, but I'm not convinced that this is the correct way to go.
to me, what we're storing in the local state file, is not REALLY a LOCAL state. because it applies to all the nodes in the cluster basically. so we might as well store it in the cluster config and include some other stuff in that file that are actually local.
the point is that this is a thing that we need to think about and discuss a bit more. I'd lean towards leaving it out and working around it for now.
This will also make this PR leaner, more focused and prevents scope creep.
No description provided.