feat: introduce state for generator and new grafana SA generator by moolen · Pull Request #3982 · external-secrets/external-secrets

moolen · 2024-10-05T20:58:42Z

Stateful Generator Proposal

This PR is a propsal to introduce state for generators alongside a implementation for a new stateful generator for managing grafana service accounts.

📜 You can find the design proposal here

gusfcarvalho · 2024-10-09T10:01:59Z

Nice! Three questions:

I take that for ExternalSecrets generatorState will be a map[dataFrom][generatorState]?
How it would look like for e.g. Password generator?
Is the grace period a configuration at which scope? Controller, Generator, PushSecret/ExternalSecret?

moolen · 2024-10-09T20:52:10Z

How it would look like for e.g. Password generator?

It has no state, there's no state to store or anything to clean up.

Is the grace period a configuration at which scope? Controller, Generator, PushSecret/ExternalSecret?

At the moment it's just a flag for the controller, hence static across all PushSecrets and ExternalSecrets. That's just the simplest and straight-forward thing, we can even consider to make it part of the .Generate() API to additionally return the grace period.

I take that for ExternalSecrets generatorState will be a map[dataFrom][generatorState]?

Not sure yet if it is even possible. I think it's quite the pain to make it work with ExternalSecret.

pkg/controllers/externalsecret/externalsecret_controller.go

Skarlso · 2024-10-28T06:22:55Z

What happens if the Status of an object is changed? The status field is an elusive thing. Using that as a store is problematic at times. The status is regarded as something that we should be able to regenerate to the same value on each reconcile attempt.

For example, having an endpoint in there that other controllers then can use to reach something. A SHA or something that doesn't change. A condition, etc... Thus, when it's missing, it can just be regenerated. Status is supposed to be a view from the Kubernete's point of the current system.

In this case, the status will store state for something that comes from a different thing that if it's running again, will result in a new thing if it's missing and could mis existing things, can it not?

Skarlso · 2024-10-28T06:41:13Z

So what I'm thinking about is the following scenario:

secret generated initially
state stored
secret is rotated
status updates to latest and gc
status is cleared the object is updated and on the same time the secret is rotated
what's the status of latest and gc now?

Skarlso · 2024-10-28T06:45:27Z

What I'm trying to say is that we are basically turning the status field into a key-value store. Basically graceperiod is flimsy since the status can disappear or can be regenerated and then all the stored config is lost. We are also increasing yaml serialization and size for each CR when we store the spec in the field.

Wouldn't it be better if we would use annotations or a specific CR for this? What if, instead of serializing it into the status field, we create a separate CR saying generator tracking or something and store that information in the spec of that CR?

If you think it's okay that the field might be lost and this is temporary data at best, I guess it's fine. :) Just keep these things in mind and don't rely on the field being there all the time. :)

pkg/controllers/externalsecret/externalsecret_controller.go

moolen · 2024-11-16T08:37:17Z

What I'm trying to say is that we are basically turning the status field into a key-value store. Basically graceperiod is flimsy since the status can disappear or can be regenerated and then all the stored config is lost. We are also increasing yaml serialization and size for each CR when we store the spec in the field.

Wouldn't it be better if we would use annotations or a specific CR for this? What if, instead of serializing it into the status field, we create a separate CR saying generator tracking or something and store that information in the spec of that CR?

If you think it's okay that the field might be lost and this is temporary data at best, I guess it's fine. :) Just keep these things in mind and don't rely on the field being there all the time. :)

Thank you for your input! You're right with all your concerns! The problem is that we can not re-build the generator state, not all providers allow us to store ownership metadata on resources we generate, e.g. grafana service account tokens. For some providers it could work, e.g. AWS IAM Keys with tags or GCP Service Accounts with labels/annotations.

I think it's ok if the status gets lost in some corner cases. Users don't typically fumble around with statuses (other than reading from it), so i don't expect that the status suddenly vanishes.

I would add that behaviour to the docs: ESO may not be able to 🧹 clean up generated contents in the following cases:

finalizer get removed by the user, not by ESO
resources which contain GC data get force deleted
the GC data in the status subresource gets tampered with
a race conditions after ESO has generated contents but dies before it is able to store the gc data in the status
if authentication with the provider is not possible in case of a clean up, the finalizer will never be removed and it is the responsibility of the user to clean up the generated contents. Authentication can fail if e.g. a secret referenced in the generator spec gets deleted, is invalid or not accessible by changes in RBAC.

I would categorize 1, 2, 3 and 5 as user errors. 4 is a risk we can not control.

I think dedicated resources for generator status would help with some of the limitations, but i think this comes with a cost: more resources and complexity, more documentation and code, training for the user that they need to be aware of that etc.

Skarlso · 2024-11-23T06:42:58Z

I'm going to find some time to do this this weekend. 😊

Skarlso · 2024-11-26T16:52:44Z

Ugh, so sorry for the conflict. :/

Skarlso · 2024-11-26T16:53:33Z

Actually those came from the controller API thingy...

Skarlso

Initial runs looks good. I have do some manual testing. Do you have a flow @moolen ?

moolen · 2024-11-26T19:52:51Z

I used:

kubectl create secret generic grafana-token --from-literal=token=the-token

Where the token is one with basically admin capabilities (create other service accounts :D)
I created a grafana free subscription for this testing.

apiVersion: generators.external-secrets.io/v1alpha1
kind: Grafana
metadata:
  name: test
spec:
  url: "moolen.grafana.net"
  auth:
    token:
      name: "grafana-token"
      key: "token"
  serviceAccount:
    name: "foo"
    role: "Editor"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: stateful-generator
spec:
  refreshInterval: 1m
  dataFrom:
    - sourceRef:
        generatorRef:
          apiVersion: generators.external-secrets.io/v1alpha1
          kind: Grafana
          name: test

edit

Once applied and reconciling you should be able to see how service accounts come and go (ever counting ids)

Skarlso · 2024-11-26T20:12:23Z

Thanks!

Skarlso

I had a deeper look and have a few questions. :)

apis/externalsecrets/v1beta1/externalsecret_types.go

pkg/generator/statemanager/statemanager.go

Skarlso · 2024-11-27T07:32:30Z

pkg/generator/statemanager/statemanager.go

+func (m *Manager) MoveStateToGC(resource *apiextensions.JSON, stateKey string, gen v1alpha1.Generator, state v1alpha1.GeneratorProviderState) {
+	m.internalState = append(m.internalState, QueueItem{
+		Commit: func() error {
+			return m.ImmediateMoveStateToGC(resource, stateKey, gen, state)
+		},
+	})
+}
+
+// ImmediateMoveStateToGC will move the generator state to GC immediately.
+func (m *Manager) ImmediateMoveStateToGC(resource *apiextensions.JSON, stateKey string, gen v1alpha1.Generator, state v1alpha1.GeneratorProviderState) error {


These two are a bit confusing API and usage wise. Like the first one sets Commit function the second one actually performs and Enqueue but at the look of the API I wouldn't have thought that.

Like, Move doesn't actually do anything until you call Commit. I would call this a setup or a set commit or something like that to make apparent that you'd have to do something actually afterwards.

WDYT?

I agree, that's bad design/naming. Let me rethink that.

pkg/generator/gc/gc.go

moolen · 2024-11-29T07:15:27Z

Meh, found a major issue after rebasing.. the status update mechanics have changed and status updates get dropped. I'll need to dig this 🐇 🕳️

Skarlso · 2024-11-29T08:03:11Z

@moolen I think most of the times in that PR the markAsFailed is not used. The reason behind that was that it would mess with the reconcile time I believe. Maybe that was an error?!

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>

sonarqubecloud · 2024-12-03T00:41:57Z

Quality Gate passed

Issues
0 New issues
1 Accepted issue

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.4% Duplication on New Code

See analysis details on SonarQube Cloud

moolen · 2024-12-05T08:28:04Z

WRT #4086

I have removed the changes from #3608 which were causing needless get requests on the stauts of ExternalSecrets.
I could not reproduce the race condition after my updates, but I also fixed the underlying issue, and we now abandon the reconcile if the existing secret has the spec.immutable field set to true.

Let me illustrate the important bits from ES's Reconcile() func:

	externalSecret := &esv1beta1.ExternalSecret{}
	err = r.Get(ctx, req.NamespacedName, externalSecret)

	// the PR removed this bit right here which seems superfluous, but see below
	// for a specific case where this might cause issues.
	err = r.SubResource("status").Get(ctx, externalSecret, externalSecret)
	if err != nil {
		log.Error(err, "failed to get status subresource")
		return ctrl.Result{}, err
	}
        // ...things
	defer func(){
		// the status update goes straight to kube-apiserver, and does not update the cache.
		// if we change the status AND requeue immediately (see below),  then we requeue with the old status.
		_ = r.Status().Update(ctx, externalSecret)
		if updateErr == nil {
			return
		}
	}()
	// in some conditions we do requeue immediately. This should happen rarely, but it can
	return ctrl.Result{Requeue: true}, nil

By (Kubernetes) design, a controller should be able to reconstruct the status. In our case we store stateful information syncedResourceVersion and refreshTime, which can not be reconstructed.

With this PR introducing another stateful field to the status subresource, this bug becomes more severe, triggering ~30-50% of the time due to many changes in status.

I think we shouldn't reintroduce the r.SubResource("status").Get( ... ) bit again, it should be fine to just reconcile twice - unless there are bugs around lifecycle guarantees, e.g. where ESO should reconcile exactly once to create a secret (refreshInterval=0). Worst case we would have to requeue with a delay to ensure the status update has propagated into the controller cache.

With that being said, i've made the decision to rewrite the state handling using a custom resource, as it seems to be less error prone and more closely follows the Kubernetes design principles.

moolen · 2024-12-05T08:30:59Z

I'm gonna close this PR and open a new one with the new implementation ⌛

moolen force-pushed the mj-pushsecret-generator-state branch 2 times, most recently from da3b860 to 8cb33ba Compare October 5, 2024 21:15

moolen force-pushed the mj-pushsecret-generator-state branch 8 times, most recently from 0a8b3c0 to ed9a1a4 Compare October 16, 2024 23:08

moolen force-pushed the mj-pushsecret-generator-state branch 5 times, most recently from 101d94c to f1a6963 Compare October 25, 2024 20:36

moolen marked this pull request as ready for review October 26, 2024 23:01

moolen requested a review from a team as a code owner October 26, 2024 23:01

moolen requested a review from Skarlso October 26, 2024 23:01

Skarlso self-assigned this Oct 27, 2024

Skarlso reviewed Oct 27, 2024

View reviewed changes

pkg/controllers/externalsecret/externalsecret_controller.go Outdated Show resolved Hide resolved

Skarlso reviewed Oct 28, 2024

View reviewed changes

pkg/controllers/externalsecret/externalsecret_controller.go Outdated Show resolved Hide resolved

moolen force-pushed the mj-pushsecret-generator-state branch 3 times, most recently from 027f30c to 1ea5d1b Compare November 16, 2024 17:15

moolen force-pushed the mj-pushsecret-generator-state branch from 1ea5d1b to 36dd28b Compare November 16, 2024 17:33

Skarlso reviewed Nov 26, 2024

View reviewed changes

Skarlso reviewed Nov 27, 2024

View reviewed changes

moolen force-pushed the mj-pushsecret-generator-state branch from 36dd28b to 62dea0e Compare December 3, 2024 00:07

moolen marked this pull request as draft December 3, 2024 00:17

moolen force-pushed the mj-pushsecret-generator-state branch from 62dea0e to a739884 Compare December 3, 2024 00:34

feat: introduce state for generator and new grafana SA generator

a372c4e

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>

moolen force-pushed the mj-pushsecret-generator-state branch from a739884 to a372c4e Compare December 3, 2024 00:41

moolen closed this Dec 5, 2024

moolen mentioned this pull request Dec 11, 2024

feat: introduce state for generator and new grafana SA generator #4203

Merged

Uh oh!

Conversation

moolen commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stateful Generator Proposal

Uh oh!

gusfcarvalho commented Oct 9, 2024

Uh oh!

moolen commented Oct 9, 2024

Uh oh!

Uh oh!

Skarlso commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skarlso commented Oct 28, 2024

Uh oh!

Skarlso commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

moolen commented Nov 16, 2024

Uh oh!

Skarlso commented Nov 23, 2024

Uh oh!

Skarlso commented Nov 26, 2024

Uh oh!

Skarlso commented Nov 26, 2024

Uh oh!

Skarlso left a comment

Choose a reason for hiding this comment

Uh oh!

moolen commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skarlso commented Nov 26, 2024

Uh oh!

Skarlso left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Skarlso Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

moolen Nov 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moolen commented Nov 29, 2024

Uh oh!

Skarlso commented Nov 29, 2024

Uh oh!

sonarqubecloud bot commented Dec 3, 2024

Quality Gate passed

Uh oh!

moolen commented Dec 5, 2024

Uh oh!

moolen commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

moolen commented Oct 5, 2024 •

edited

Loading

Skarlso commented Oct 28, 2024 •

edited

Loading

Skarlso commented Oct 28, 2024 •

edited

Loading

moolen commented Nov 26, 2024 •

edited

Loading