-
Notifications
You must be signed in to change notification settings - Fork 375
fix(controller): ensure live state for agents matches desired state #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): ensure live state for agents matches desired state #700
Conversation
|
One question I do have is, if we don't have this check in place, then do we even need to store the hash anymore? |
|
On this, unfortunately, there is going to be no getting around having to fully reconcile the live state of the owned objects if we want to avoid all unnecessary/redundant API calls. However, reducing unnecessary API calls could be seen as a performance improvement that can come after ensuring that the bug is fixed. Or not 😄 |
10d91c8 to
ffe1d87
Compare
|
Have now updated this to fully reconcile the objects owned by the agents controller with much inspiration from https://github.com/open-telemetry/opentelemetry-operator. FYI @EItanya Given the API version update, I figured I'd also take the opportunity to drop the I've also left the cleanup of owned objects that are no longer needed as a TODO for the sake of expediency. |
ffe1d87 to
1f13e3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the controller's reconciliation logic to ensure live state matches desired state by removing the config hash optimization that was preventing recreation of deleted resources. The main changes replace the simple patch-based approach with a more robust reconciliation mechanism using CreateOrUpdate patterns.
- Removes config hash-based optimization that prevented proper state reconciliation
- Implements comprehensive object mutation and reconciliation logic
- Adds proper error handling for immutable field changes with automatic deletion/recreation
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| go/go.mod | Adds mergo dependency for object merging functionality |
| go/controller/translator/mutate.go | New file implementing mutation functions for different Kubernetes resource types |
| go/controller/internal/reconciler/reconciler.go | Major refactor replacing patch-based updates with CreateOrUpdate pattern and removing config hash logic |
| go/controller/api/v1alpha2/zz_generated.deepcopy.go | Removes ConfigHash field from generated deepcopy methods |
| go/controller/api/v1alpha2/agent_types.go | Removes ConfigHash field from AgentStatus struct |
| go/config/crd/bases/kagent.dev_agents.yaml | Updates CRD to remove configHash field from status |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
go/controller/translator/mutate.go
Outdated
| var ( | ||
| ImmutableChangeErr *ImmutableFieldChangeErr | ||
| ) | ||
|
|
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This global variable declaration creates a nil pointer that will cause panics when used with errors.As(). This should be removed as the error type checking should be done directly with the concrete type.
| } | ||
|
|
||
| func mutateServiceAccount(existing, desired *corev1.ServiceAccount) { | ||
| // Nothing to do here for the time being - we don't really care about anything but the existence of the ServiceAccount |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is empty but the comment suggests it's intentional. Consider adding a proper docstring to explain why no mutation is needed for ServiceAccount resources.
| // Nothing to do here for the time being - we don't really care about anything but the existence of the ServiceAccount | |
| // mutateServiceAccount does not perform any mutation because only the existence of the ServiceAccount | |
| // is required. No fields are currently managed or reconciled by this operator. | |
| func mutateServiceAccount(existing, desired *corev1.ServiceAccount) { |
| return nil, fmt.Errorf("failed to translate agent %s/%s: %v", agent.Namespace, agent.Name, err) | ||
| } | ||
|
|
||
| ownedObjects := map[types.UID]client.Object{} // TODO: We should lookup all objects that are actually owned by the controller to ensure that resources are correctly pruned over time |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ownedObjects map is initialized as empty but never populated. This means the pruning logic in deleteObjects() will never execute, potentially leaving orphaned resources. The TODO comment indicates this is a known issue that should be addressed.
| ownedObjects := map[types.UID]client.Object{} // TODO: We should lookup all objects that are actually owned by the controller to ensure that resources are correctly pruned over time | |
| ownedObjects, err := a.getOwnedObjects(ctx, agent) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to list owned objects: %v", err) | |
| } |
| return createOrUpdateErr | ||
| }) | ||
|
|
||
| if err != nil && errors.As(err, &translator.ImmutableChangeErr) { |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using errors.As with the global nil pointer translator.ImmutableChangeErr will always return false. This should be changed to check for the concrete error type: var immutableErr *translator.ImmutableFieldChangeErr; if errors.As(err, &immutableErr)
| if err != nil && errors.As(err, &translator.ImmutableChangeErr) { | |
| var immutableErr *translator.ImmutableFieldChangeErr | |
| if err != nil && errors.As(err, &immutableErr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not true. If the pointer were nil, then the code would panic.
EItanya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, thanks so much for the PR, these changes look awesome.
At a high level though I'm not 100% sure why this is better than the configHash approach. I understand adding pruning, but right now our system never creates, and then later deleted the resource without the parent resource being deleted itself, and for that we use ownerReferences.
This code is a bit complex, so before we go ahead and merge this I wanna think through the end state we're looking for.
The issue with the config hash approach is that the live state of the resources that were created is only ever reconciled (which, in the current code, is patching) if:
This means that drift can occur. For example, if I manually delete the |
Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
1f13e3e to
5c4ea49
Compare
EItanya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, tested locally and behavior seems much more stable :)
| reason = "AgentReconcileFailed" | ||
| reconcileLog.Error(inputErr, "failed to reconcile agent", "agent", utils.GetObjectRef(agent)) | ||
| message = err.Error() | ||
| reason = "ReconcileFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we should probably define these Reasons as consts in the API types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed :)
|
|
||
| if err != nil && errors.As(err, &translator.ImmutableChangeErr) { | ||
| l.Error(err, "detected immutable field change, trying to delete, new object will be created on next reconcile", "existing", existing.GetName()) | ||
| delErr := a.kube.Delete(ctx, existing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit scary to delete something without informing the user. ideally we shouldn't modify immutable fields, so we have a consistent behavior.
Users might be surprised if the result of them changing a field is that pods get deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst I agree that the controller ideally shouldn't modify immutable fields (it doesn't at present), the time may come when this is needed. That said, I don't see deleting as scary or surprising; these resources are owned by the controller, and it should be free to do as needed to ensure that the live state matches the desired state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @yuval-k here. If we’re going to support deletes I believe we’re going to need some sort of rollout mechanism. We do own these resources, but agent downtime for an update seems like a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Will rip this bit of code out now - as mentioned, the controller doesn't try to update immutable fields at present so this is really just a safe-guard against a potential future problem. I'm more than happy to kick this can down the road and make it someone else's problem 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 87cb91a
Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
c9ee8c7 to
87cb91a
Compare
…agent-dev#700) At present, the controller will ignore any changes to the live state of resources it owns due to the check that's being removed in this PR. This means that, for example, if the deployment for an agent is deleted, it will not be recreated until: - The agent specification changes - The agent is recreated - The controller is restarted (and this is only true because the database is not persistent). ~However, removing this optimisation will increase load on the Kube API and also surfaces other issues with the controller code (e.g. controllers are currently updating objects they do not own). Opening this PR in draft for now so the issue is visible and we can discuss how best to proceed. I have also started looking into and will try to open smaller, related PRs that aim to minimise unnecessary reconciliation, thereby alleviating/diminishing the impact of removing this check.~ --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
At present, the controller will ignore any changes to the live state of resources it owns due to the check that's being removed in this PR. This means that, for example, if the deployment for an agent is deleted, it will not be recreated until:
However, removing this optimisation will increase load on the Kube API and also surfaces other issues with the controller code (e.g. controllers are currently updating objects they do not own). Opening this PR in draft for now so the issue is visible and we can discuss how best to proceed. I have also started looking into and will try to open smaller, related PRs that aim to minimise unnecessary reconciliation, thereby alleviating/diminishing the impact of removing this check.