Skip to content

feat: introduce state for generator and new grafana SA generator#3982

Closed
moolen wants to merge 1 commit intomainfrom
mj-pushsecret-generator-state
Closed

feat: introduce state for generator and new grafana SA generator#3982
moolen wants to merge 1 commit intomainfrom
mj-pushsecret-generator-state

Conversation

@moolen
Copy link
Copy Markdown
Member

@moolen moolen commented Oct 5, 2024

Stateful Generator Proposal

This PR is a propsal to introduce state for generators alongside a implementation for a new stateful generator for managing grafana service accounts.

📜 You can find the design proposal here

@moolen moolen force-pushed the mj-pushsecret-generator-state branch 2 times, most recently from da3b860 to 8cb33ba Compare October 5, 2024 21:15
@gusfcarvalho
Copy link
Copy Markdown
Member

Nice! Three questions:

  • I take that for ExternalSecrets generatorState will be a map[dataFrom][generatorState]?
  • How it would look like for e.g. Password generator?
  • Is the grace period a configuration at which scope? Controller, Generator, PushSecret/ExternalSecret?

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Oct 9, 2024

How it would look like for e.g. Password generator?

It has no state, there's no state to store or anything to clean up.

Is the grace period a configuration at which scope? Controller, Generator, PushSecret/ExternalSecret?

At the moment it's just a flag for the controller, hence static across all PushSecrets and ExternalSecrets. That's just the simplest and straight-forward thing, we can even consider to make it part of the .Generate() API to additionally return the grace period.

I take that for ExternalSecrets generatorState will be a map[dataFrom][generatorState]?

Not sure yet if it is even possible. I think it's quite the pain to make it work with ExternalSecret.

@moolen moolen force-pushed the mj-pushsecret-generator-state branch 8 times, most recently from 0a8b3c0 to ed9a1a4 Compare October 16, 2024 23:08
@moolen moolen force-pushed the mj-pushsecret-generator-state branch 5 times, most recently from 101d94c to f1a6963 Compare October 25, 2024 20:36
@moolen moolen marked this pull request as ready for review October 26, 2024 23:01
@moolen moolen requested a review from a team as a code owner October 26, 2024 23:01
@moolen moolen requested a review from Skarlso October 26, 2024 23:01
@Skarlso Skarlso self-assigned this Oct 27, 2024
@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Oct 28, 2024

What happens if the Status of an object is changed? The status field is an elusive thing. Using that as a store is problematic at times. The status is regarded as something that we should be able to regenerate to the same value on each reconcile attempt.

For example, having an endpoint in there that other controllers then can use to reach something. A SHA or something that doesn't change. A condition, etc... Thus, when it's missing, it can just be regenerated. Status is supposed to be a view from the Kubernete's point of the current system.

In this case, the status will store state for something that comes from a different thing that if it's running again, will result in a new thing if it's missing and could mis existing things, can it not?

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Oct 28, 2024

So what I'm thinking about is the following scenario:

  • secret generated initially
  • state stored
  • secret is rotated
  • status updates to latest and gc
  • status is cleared the object is updated and on the same time the secret is rotated
  • what's the status of latest and gc now?

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Oct 28, 2024

What I'm trying to say is that we are basically turning the status field into a key-value store. Basically graceperiod is flimsy since the status can disappear or can be regenerated and then all the stored config is lost. We are also increasing yaml serialization and size for each CR when we store the spec in the field.

Wouldn't it be better if we would use annotations or a specific CR for this? What if, instead of serializing it into the status field, we create a separate CR saying generator tracking or something and store that information in the spec of that CR?

If you think it's okay that the field might be lost and this is temporary data at best, I guess it's fine. :) Just keep these things in mind and don't rely on the field being there all the time. :)

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Nov 16, 2024

What I'm trying to say is that we are basically turning the status field into a key-value store. Basically graceperiod is flimsy since the status can disappear or can be regenerated and then all the stored config is lost. We are also increasing yaml serialization and size for each CR when we store the spec in the field.

Wouldn't it be better if we would use annotations or a specific CR for this? What if, instead of serializing it into the status field, we create a separate CR saying generator tracking or something and store that information in the spec of that CR?

If you think it's okay that the field might be lost and this is temporary data at best, I guess it's fine. :) Just keep these things in mind and don't rely on the field being there all the time. :)

Thank you for your input! You're right with all your concerns! The problem is that we can not re-build the generator state, not all providers allow us to store ownership metadata on resources we generate, e.g. grafana service account tokens. For some providers it could work, e.g. AWS IAM Keys with tags or GCP Service Accounts with labels/annotations.

I think it's ok if the status gets lost in some corner cases. Users don't typically fumble around with statuses (other than reading from it), so i don't expect that the status suddenly vanishes.

I would add that behaviour to the docs: ESO may not be able to 🧹 clean up generated contents in the following cases:

  1. finalizer get removed by the user, not by ESO
  2. resources which contain GC data get force deleted
  3. the GC data in the status subresource gets tampered with
  4. a race conditions after ESO has generated contents but dies before it is able to store the gc data in the status
  5. if authentication with the provider is not possible in case of a clean up, the finalizer will never be removed and it is the responsibility of the user to clean up the generated contents. Authentication can fail if e.g. a secret referenced in the generator spec gets deleted, is invalid or not accessible by changes in RBAC.

I would categorize 1, 2, 3 and 5 as user errors. 4 is a risk we can not control.

I think dedicated resources for generator status would help with some of the limitations, but i think this comes with a cost: more resources and complexity, more documentation and code, training for the user that they need to be aware of that etc.

@moolen moolen force-pushed the mj-pushsecret-generator-state branch 3 times, most recently from 027f30c to 1ea5d1b Compare November 16, 2024 17:15
@moolen moolen force-pushed the mj-pushsecret-generator-state branch from 1ea5d1b to 36dd28b Compare November 16, 2024 17:33
@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 23, 2024

I'm going to find some time to do this this weekend. 😊

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 26, 2024

Ugh, so sorry for the conflict. :/

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 26, 2024

Actually those came from the controller API thingy...

Copy link
Copy Markdown
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial runs looks good. I have do some manual testing. Do you have a flow @moolen ?

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Nov 26, 2024

I used:

kubectl create secret generic grafana-token --from-literal=token=the-token

Where the token is one with basically admin capabilities (create other service accounts :D)
I created a grafana free subscription for this testing.

apiVersion: generators.external-secrets.io/v1alpha1
kind: Grafana
metadata:
  name: test
spec:
  url: "moolen.grafana.net"
  auth:
    token:
      name: "grafana-token"
      key: "token"
  serviceAccount:
    name: "foo"
    role: "Editor"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: stateful-generator
spec:
  refreshInterval: 1m
  dataFrom:
    - sourceRef:
        generatorRef:
          apiVersion: generators.external-secrets.io/v1alpha1
          kind: Grafana
          name: test

edit

Once applied and reconciling you should be able to see how service accounts come and go (ever counting ids)

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 26, 2024

Thanks!

Copy link
Copy Markdown
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a deeper look and have a few questions. :)

Comment on lines +102 to +111
func (m *Manager) MoveStateToGC(resource *apiextensions.JSON, stateKey string, gen v1alpha1.Generator, state v1alpha1.GeneratorProviderState) {
m.internalState = append(m.internalState, QueueItem{
Commit: func() error {
return m.ImmediateMoveStateToGC(resource, stateKey, gen, state)
},
})
}

// ImmediateMoveStateToGC will move the generator state to GC immediately.
func (m *Manager) ImmediateMoveStateToGC(resource *apiextensions.JSON, stateKey string, gen v1alpha1.Generator, state v1alpha1.GeneratorProviderState) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are a bit confusing API and usage wise. Like the first one sets Commit function the second one actually performs and Enqueue but at the look of the API I wouldn't have thought that.

Like, Move doesn't actually do anything until you call Commit. I would call this a setup or a set commit or something like that to make apparent that you'd have to do something actually afterwards.

WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that's bad design/naming. Let me rethink that.

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Nov 29, 2024

Meh, found a major issue after rebasing.. the status update mechanics have changed and status updates get dropped. I'll need to dig this 🐇 🕳️

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 29, 2024

@moolen I think most of the times in that PR the markAsFailed is not used. The reason behind that was that it would mess with the reconcile time I believe. Maybe that was an error?!

@moolen moolen force-pushed the mj-pushsecret-generator-state branch from 36dd28b to 62dea0e Compare December 3, 2024 00:07
@moolen moolen marked this pull request as draft December 3, 2024 00:17
@moolen moolen force-pushed the mj-pushsecret-generator-state branch from 62dea0e to a739884 Compare December 3, 2024 00:34
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
@moolen moolen force-pushed the mj-pushsecret-generator-state branch from a739884 to a372c4e Compare December 3, 2024 00:41
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Dec 3, 2024

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Dec 5, 2024

WRT #4086

I have removed the changes from #3608 which were causing needless get requests on the stauts of ExternalSecrets.
I could not reproduce the race condition after my updates, but I also fixed the underlying issue, and we now abandon the reconcile if the existing secret has the spec.immutable field set to true.

Let me illustrate the important bits from ES's Reconcile() func:

	externalSecret := &esv1beta1.ExternalSecret{}
	err = r.Get(ctx, req.NamespacedName, externalSecret)

	// the PR removed this bit right here which seems superfluous, but see below
	// for a specific case where this might cause issues.
	err = r.SubResource("status").Get(ctx, externalSecret, externalSecret)
	if err != nil {
		log.Error(err, "failed to get status subresource")
		return ctrl.Result{}, err
	}
        // ...things
	defer func(){
		// the status update goes straight to kube-apiserver, and does not update the cache.
		// if we change the status AND requeue immediately (see below),  then we requeue with the old status.
		_ = r.Status().Update(ctx, externalSecret)
		if updateErr == nil {
			return
		}
	}()
	// in some conditions we do requeue immediately. This should happen rarely, but it can
	return ctrl.Result{Requeue: true}, nil

By (Kubernetes) design, a controller should be able to reconstruct the status. In our case we store stateful information syncedResourceVersion and refreshTime, which can not be reconstructed.

With this PR introducing another stateful field to the status subresource, this bug becomes more severe, triggering ~30-50% of the time due to many changes in status.

I think we shouldn't reintroduce the r.SubResource("status").Get( ... ) bit again, it should be fine to just reconcile twice - unless there are bugs around lifecycle guarantees, e.g. where ESO should reconcile exactly once to create a secret (refreshInterval=0). Worst case we would have to requeue with a delay to ensure the status update has propagated into the controller cache.


With that being said, i've made the decision to rewrite the state handling using a custom resource, as it seems to be less error prone and more closely follows the Kubernetes design principles.

@moolen
Copy link
Copy Markdown
Member Author

moolen commented Dec 5, 2024

I'm gonna close this PR and open a new one with the new implementation ⌛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants