design: cloud change logs by jbw976 · Pull Request #5822 · crossplane/crossplane

jbw976 · 2024-07-09T05:11:49Z

Description of your changes

This PR contains a 1-pager design document for the cloud change logs feature described in #5477. The alpha implementation of this feature is intended to be included in the v1.17 release.

This design stems from some collaborative brainstorming sessions with @negz, so thank you for the assistance so far! 🙇‍♂️

While this design is being reviewed by the maintainer team and community, I will be prototyping on the side to a) further validate the feasibility of the proposed design and b) start making progress without delay.

Pro-tip to reviewers: this document makes heavy usage of markdown section headers, as well as frequent summary checkpoints for the reader, so the document is easily parseable and navigable using github's "outline" view:

I have:

Read and followed Crossplane's contribution process.
Run earthly +reviewable to ensure this PR is ready for review.
~~Added or updated unit tests.~~
~~Added or updated e2e tests.~~
~~Linked a PR or a docs tracking issue to document this change.~~
~~Added backport release-x.y labels to auto-backport this PR.~~

Need help with this checklist? See the cheat sheet.

design/one-pager-change-logs.md

sttts · 2024-07-09T14:36:14Z

design/one-pager-change-logs.md

+
+The change log entry data will be serialized in a binary protobuf format for
+transmission over the gRPC connection, while the data will be persisted to the
+pod logs in a JSON serialized format.


Why not marshal into a JSON blob and send a byte stream via grpc?

Good question! I can see the general idea making sense to keep the format consistent all the way through instead of transforming back and forth. I'll need to learn a bit more details about gRPC connections and their transfer modes though, i.e. what may we lose or have to do differently by using that transfer mode instead.

design/one-pager-change-logs.md

ulucinar · 2024-07-11T15:54:03Z

design/one-pager-change-logs.md

+| Before operation desired/observed state of resource | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider` |
+| After operation desired/observed state of resource | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider`  |
+| Result of operation | `success` or error object |
+| (optional) Additional information that Providers can set as they choose | `{JSON object of arbitrary data}` |


I remember there was a separate discussion with @luxas on whether we can use the change logs feature to record data about all the reconciliations (for example, reconciliation timestamps), and not just about those reconciliations which attempt to correct drifts. Then the after observation would become optional (only recorded for those reconciliations that update the external resource). crossplane-runtime may not need to be opinionated on this, we would extend crossplane-runtime, and if needed, the package manager and the like, so that the side car (the gRPC server) and the client are properly setup, and the provider could use this infrastructure to implement some related features. The proposed arbitrary data field opens up the possibility for similar cases but the crossplane-runtime had better be un-opinionated about them, if we would like to allow other use cases.

Re-observing for the after-update external resource state would be the issue in this case, if we take the control from the managed reconciler. I think we would not like to have the external client to make a second observation on its own. One idea could be to allow the external client to be able to signal to the managed reconciler the need for an immediate reconciliation so that the managed reconciler requeues an immediate rate limited reconciliation request without waiting for the next poll period.

How do you folks feel about allowing the providers to record more reconciliation data, not limited to updates?

How do you folks feel about allowing the providers to record more reconciliation data, not limited to updates?

I don't feel like I understand well enough to answer the question. What are some examples of reconciliation data other than timestamps?

jbw976

Thanks a lot for the detailed review @sttts! I have some more things to think through (e.g. how the sidecar pod could be injected and the format of the data transfer on the wire), but I've included responses to all of your comments! 🙇

design/one-pager-change-logs.md

jbw976 · 2024-07-13T00:53:08Z

design/one-pager-change-logs.md

+
+The change log entry data will be serialized in a binary protobuf format for
+transmission over the gRPC connection, while the data will be persisted to the
+pod logs in a JSON serialized format.


Good question! I can see the general idea making sense to keep the format consistent all the way through instead of transforming back and forth. I'll need to learn a bit more details about gRPC connections and their transfer modes though, i.e. what may we lose or have to do differently by using that transfer mode instead.

design/one-pager-change-logs.md

jbw976 · 2024-07-22T23:51:26Z

Thanks folks for the reviews and discussions so far, really appreciate it!

I spent some cycles last week prototyping an end to end effort to better understand the viability of the proposed design and how all the components fit together. I've responded to most all comments now, and I've pushed a few new commits with additional and clarifying content for:

DeploymentRuntimeConfig works great to inject sidecar container, we can rely on this instead of making changes to package manager. An example was added of what this looks like.
Updated the scope and description of code changes we will make across the various Crossplane ecosystem components in our implementation rollout.
Added further details about why Kubernetes audit logs are not an ideal fit for our needs.
Added an alternative considered section for cloud provider audit logs too.

design/one-pager-change-logs.md

turkenh · 2024-08-01T13:03:04Z

design/one-pager-change-logs.md

+The code paths through the managed reconciler's `Reconcile()` method are already
+a bit complex. One design choice that has helped minimize the complexity and
+assist in readability is that the reconciler returns as early as it can after
+performing an operation. Since we will now be adding an additional `Observe()`


Additional Observe means one more extra call to the cloud provider API and will increase probability of API ratelimiting. At least for Creates, we always queue in the managed reconcilers and have a consecutive Observe, which we can leverage somehow.

I think the additional Observe() call is a real concern, we already get complaints that Crossplane is making too many API calls. I think it would be logical to requeue after a CUD operation and use that Observe() call as that "after" state for the previous operation. Maybe use an "operation in-progress" flag to know that there has been a CUD request that needs verification.

I definitely hear the concerns about the extra Observe() call. The usefulness of it is even being called into question with #5822 (comment) 😊

I'm not against cutting the "after" state completely, at least from this initial alpha implementation. I think we still tell an interesting story about WHY we made a change without having that data.

Not implementing an "after" state may be preferable instead of having to manage state across multiple reconciles to link together the Observe() results from a subsequent reconcile with the Create() or Update() operation from a previous reconcile. As things are currently implemented (crossplane/crossplane-runtime@master...jbw976:crossplane-runtime:change-logs), everything is contained within a single reconcile and is definitely more simple for that 😇

So in this alpha implementation, we could:

perform the extra Observe() call and see the feedback from early adopters

cut the "after" state functionality and do nothing

code something more complex that connects Observe() state to Create()/Update()/Delete() observations across reconciliations

Strong opinions from you all? #3 is my least preferred I think 😂

I think alpha is a great time to see what the impact of the second Observe() call would be. If it's bad we can always cut it when it goes beta.

You could always put it under it's own nested flag so the feature could run with or without the second Observe() call.

Chatted with @negz a bit more about "after" state and a 2nd Observe() call:

From the first Observe() call, we should have a very up to date status.atProvider for any modern managed resource

The difference between spec.forProvider and status.atProvider should fairly conclusively indicate why we are performing an operation

The content of an "after" state collection from a 2nd Observe() is likely to not be any different from the "before", and therefore have limited value as called out in design: cloud change logs #5822 (comment)

Therefore, it's reasonable to remove from this design the "after" state and thus any reason to perform a 2nd Observe(). Without it, we still have a useful record of why we are making changes, whether that be because the spec's desired state has changed, we detected drift in the external resource, etc.

Perhaps Providers could be extended further in the future to provide more details in their ExternalObservation about why they concluded ResourceUpToDate: false and we can include that information in the change logs. For example, there is a ExternalObservation.Diff field, but I don't believe many Providers are currently setting it.

bobh66

Overall I think it's looking good - my main concern is the extra Observe and the impact it has on the provider API usage/throttling.

design/one-pager-change-logs.md

bobh66 · 2024-08-01T14:53:51Z

design/one-pager-change-logs.md

+The code paths through the managed reconciler's `Reconcile()` method are already
+a bit complex. One design choice that has helped minimize the complexity and
+assist in readability is that the reconciler returns as early as it can after
+performing an operation. Since we will now be adding an additional `Observe()`


I think the additional Observe() call is a real concern, we already get complaints that Crossplane is making too many API calls. I think it would be logical to requeue after a CUD operation and use that Observe() call as that "after" state for the previous operation. Maybe use an "operation in-progress" flag to know that there has been a CUD request that needs verification.

design/one-pager-change-logs.md

Signed-off-by: Jared Watts <jbw976@gmail.com>

* `DeploymentRuntimeConfig` works great to inject sidecar container, we can rely on this instead of making changes to package manager. * Updated the scope and description of code changes we will make across the various Crossplane ecosystem components. * Added further details about why Kubernetes audit logs are not an ideal fit for our needs. Signed-off-by: Jared Watts <jbw976@gmail.com>

Signed-off-by: Jared Watts <jbw976@gmail.com>

turkenh

Looks good to me, great write-up 🙌

lindblombr · 2024-08-06T15:58:19Z

design/one-pager-change-logs.md

+| Provider name | `xpkg.upbound.io/upbound/provider-aws-ec2:v1.8.0` |
+| Type (GVK) | `apiVersion: ec2.aws.upbound.io/v1beta2, kind: Instance` |
+| Name | `dev-instance-bt66d` |
+| External Name | `vpc-4fa03d9fb92dfec50` |


Any chance we can include resource labels?

Assume we aggregate these changelog events into a search index. We'd then want to correlate them with higher-level events (ArgoCD or FluxCD resource events) and finally point to a commit (or set of commits).

Having the resource labels as KVs in this schema would allow us to build the relationship with those higher-level events more directly, providing a solid end-to-end view into how changes at the higher-end of the abstraction flow to changes at the lowest end.

Signed-off-by: Jared Watts <jbw976@gmail.com>

vijaymateti · 2024-11-05T01:24:36Z

@negz It would be great if this design doc can be merged as cloud logs already made it to latest release and not much documentation is around for the same. Thank you!

jbw976 requested review from a team and negz as code owners July 9, 2024 05:11

jbw976 requested a review from turkenh July 9, 2024 05:11

jeanduplessis reviewed Jul 9, 2024

View reviewed changes