Skip to content

Conversation

@onematchfox
Copy link
Contributor

@onematchfox onematchfox commented Aug 7, 2025

At present, the controller will ignore any changes to the live state of resources it owns due to the check that's being removed in this PR. This means that, for example, if the deployment for an agent is deleted, it will not be recreated until:

  • The agent specification changes
  • The agent is recreated
  • The controller is restarted (and this is only true because the database is not persistent).

However, removing this optimisation will increase load on the Kube API and also surfaces other issues with the controller code (e.g. controllers are currently updating objects they do not own). Opening this PR in draft for now so the issue is visible and we can discuss how best to proceed. I have also started looking into and will try to open smaller, related PRs that aim to minimise unnecessary reconciliation, thereby alleviating/diminishing the impact of removing this check.

@onematchfox
Copy link
Contributor Author

One question I do have is, if we don't have this check in place, then do we even need to store the hash anymore?

@onematchfox
Copy link
Contributor Author

On this, unfortunately, there is going to be no getting around having to fully reconcile the live state of the owned objects if we want to avoid all unnecessary/redundant API calls. However, reducing unnecessary API calls could be seen as a performance improvement that can come after ensuring that the bug is fixed. Or not 😄

@onematchfox onematchfox force-pushed the fix-controller-reconcile branch 3 times, most recently from 10d91c8 to ffe1d87 Compare August 14, 2025 16:14
@onematchfox
Copy link
Contributor Author

onematchfox commented Aug 14, 2025

Have now updated this to fully reconcile the objects owned by the agents controller with much inspiration from https://github.com/open-telemetry/opentelemetry-operator. FYI @EItanya Given the API version update, I figured I'd also take the opportunity to drop the ConfigHash from agent status; although, to be honest, I'm not an expert on versioning Kubernetes APIs, so I'm not 100% sure on what the correct course of action is here. Happy to revert that out if needed.

I've also left the cleanup of owned objects that are no longer needed as a TODO for the sake of expediency.

@onematchfox onematchfox marked this pull request as ready for review August 14, 2025 16:29
Copilot AI review requested due to automatic review settings August 14, 2025 16:29
@onematchfox onematchfox force-pushed the fix-controller-reconcile branch from ffe1d87 to 1f13e3e Compare August 14, 2025 16:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the controller's reconciliation logic to ensure live state matches desired state by removing the config hash optimization that was preventing recreation of deleted resources. The main changes replace the simple patch-based approach with a more robust reconciliation mechanism using CreateOrUpdate patterns.

  • Removes config hash-based optimization that prevented proper state reconciliation
  • Implements comprehensive object mutation and reconciliation logic
  • Adds proper error handling for immutable field changes with automatic deletion/recreation

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
go/go.mod Adds mergo dependency for object merging functionality
go/controller/translator/mutate.go New file implementing mutation functions for different Kubernetes resource types
go/controller/internal/reconciler/reconciler.go Major refactor replacing patch-based updates with CreateOrUpdate pattern and removing config hash logic
go/controller/api/v1alpha2/zz_generated.deepcopy.go Removes ConfigHash field from generated deepcopy methods
go/controller/api/v1alpha2/agent_types.go Removes ConfigHash field from AgentStatus struct
go/config/crd/bases/kagent.dev_agents.yaml Updates CRD to remove configHash field from status

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

var (
ImmutableChangeErr *ImmutableFieldChangeErr
)

Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This global variable declaration creates a nil pointer that will cause panics when used with errors.As(). This should be removed as the error type checking should be done directly with the concrete type.

Suggested change

Copilot uses AI. Check for mistakes.
}

func mutateServiceAccount(existing, desired *corev1.ServiceAccount) {
// Nothing to do here for the time being - we don't really care about anything but the existence of the ServiceAccount
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is empty but the comment suggests it's intentional. Consider adding a proper docstring to explain why no mutation is needed for ServiceAccount resources.

Suggested change
// Nothing to do here for the time being - we don't really care about anything but the existence of the ServiceAccount
// mutateServiceAccount does not perform any mutation because only the existence of the ServiceAccount
// is required. No fields are currently managed or reconciled by this operator.
func mutateServiceAccount(existing, desired *corev1.ServiceAccount) {

Copilot uses AI. Check for mistakes.
return nil, fmt.Errorf("failed to translate agent %s/%s: %v", agent.Namespace, agent.Name, err)
}

ownedObjects := map[types.UID]client.Object{} // TODO: We should lookup all objects that are actually owned by the controller to ensure that resources are correctly pruned over time
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ownedObjects map is initialized as empty but never populated. This means the pruning logic in deleteObjects() will never execute, potentially leaving orphaned resources. The TODO comment indicates this is a known issue that should be addressed.

Suggested change
ownedObjects := map[types.UID]client.Object{} // TODO: We should lookup all objects that are actually owned by the controller to ensure that resources are correctly pruned over time
ownedObjects, err := a.getOwnedObjects(ctx, agent)
if err != nil {
return nil, fmt.Errorf("failed to list owned objects: %v", err)
}

Copilot uses AI. Check for mistakes.
return createOrUpdateErr
})

if err != nil && errors.As(err, &translator.ImmutableChangeErr) {
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using errors.As with the global nil pointer translator.ImmutableChangeErr will always return false. This should be changed to check for the concrete error type: var immutableErr *translator.ImmutableFieldChangeErr; if errors.As(err, &immutableErr)

Suggested change
if err != nil && errors.As(err, &translator.ImmutableChangeErr) {
var immutableErr *translator.ImmutableFieldChangeErr
if err != nil && errors.As(err, &immutableErr) {

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true. If the pointer were nil, then the code would panic.

Copy link
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, thanks so much for the PR, these changes look awesome.

At a high level though I'm not 100% sure why this is better than the configHash approach. I understand adding pruning, but right now our system never creates, and then later deleted the resource without the parent resource being deleted itself, and for that we use ownerReferences.

This code is a bit complex, so before we go ahead and merge this I wanna think through the end state we're looking for.

@onematchfox
Copy link
Contributor Author

onematchfox commented Aug 14, 2025

At a high level though I'm not 100% sure why this is better than the configHash approach.

The issue with the config hash approach is that the live state of the resources that were created is only ever reconciled (which, in the current code, is patching) if:

  • The agent specification changes (the ConfigHash changes)
  • The agent is recreated (a new ConfigHash is created)
  • The controller is restarted (essentially the same as a new Agent and only because the database is not persistent)

This means that drift can occur. For example, if I manually delete the Deployment for an Agent, it will not be re-created until one of the above conditions is met. The same holds for updates and across all owned resources. Hopefully that makes more sense?

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
@onematchfox onematchfox force-pushed the fix-controller-reconcile branch from 1f13e3e to 5c4ea49 Compare August 26, 2025 09:06
@onematchfox onematchfox requested a review from yuval-k as a code owner August 26, 2025 09:06
EItanya
EItanya previously approved these changes Aug 26, 2025
Copy link
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, tested locally and behavior seems much more stable :)

reason = "AgentReconcileFailed"
reconcileLog.Error(inputErr, "failed to reconcile agent", "agent", utils.GetObjectRef(agent))
message = err.Error()
reason = "ReconcileFailed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should probably define these Reasons as consts in the API types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed :)


if err != nil && errors.As(err, &translator.ImmutableChangeErr) {
l.Error(err, "detected immutable field change, trying to delete, new object will be created on next reconcile", "existing", existing.GetName())
delErr := a.kube.Delete(ctx, existing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit scary to delete something without informing the user. ideally we shouldn't modify immutable fields, so we have a consistent behavior.

Users might be surprised if the result of them changing a field is that pods get deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst I agree that the controller ideally shouldn't modify immutable fields (it doesn't at present), the time may come when this is needed. That said, I don't see deleting as scary or surprising; these resources are owned by the controller, and it should be free to do as needed to ensure that the live state matches the desired state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @yuval-k here. If we’re going to support deletes I believe we’re going to need some sort of rollout mechanism. We do own these resources, but agent downtime for an update seems like a bad idea.

Copy link
Contributor Author

@onematchfox onematchfox Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Will rip this bit of code out now - as mentioned, the controller doesn't try to update immutable fields at present so this is really just a safe-guard against a potential future problem. I'm more than happy to kick this can down the road and make it someone else's problem 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 87cb91a

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
@onematchfox onematchfox force-pushed the fix-controller-reconcile branch from c9ee8c7 to 87cb91a Compare August 26, 2025 15:05
@EItanya EItanya merged commit dbab930 into kagent-dev:main Aug 26, 2025
13 of 14 checks passed
yanivmn pushed a commit to yanivmn/kagent that referenced this pull request Aug 28, 2025
…agent-dev#700)

At present, the controller will ignore any changes to the live state of
resources it owns due to the check that's being removed in this PR. This
means that, for example, if the deployment for an agent is deleted, it
will not be recreated until:
- The agent specification changes
- The agent is recreated
- The controller is restarted (and this is only true because the
database is not persistent).

~However, removing this optimisation will increase load on the Kube API
and also surfaces other issues with the controller code (e.g.
controllers are currently updating objects they do not own). Opening
this PR in draft for now so the issue is visible and we can discuss how
best to proceed. I have also started looking into and will try to open
smaller, related PRs that aim to minimise unnecessary reconciliation,
thereby alleviating/diminishing the impact of removing this check.~

---------

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
@onematchfox onematchfox deleted the fix-controller-reconcile branch September 5, 2025 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants