Skip to content

design: cloud change logs#5822

Merged
negz merged 8 commits intocrossplane:mainfrom
jbw976:design-change-logs
Jan 8, 2025
Merged

design: cloud change logs#5822
negz merged 8 commits intocrossplane:mainfrom
jbw976:design-change-logs

Conversation

@jbw976
Copy link
Copy Markdown
Member

@jbw976 jbw976 commented Jul 9, 2024

Description of your changes

This PR contains a 1-pager design document for the cloud change logs feature described in #5477. The alpha implementation of this feature is intended to be included in the v1.17 release.

This design stems from some collaborative brainstorming sessions with @negz, so thank you for the assistance so far! 🙇‍♂️

While this design is being reviewed by the maintainer team and community, I will be prototyping on the side to a) further validate the feasibility of the proposed design and b) start making progress without delay.

Pro-tip to reviewers: this document makes heavy usage of markdown section headers, as well as frequent summary checkpoints for the reader, so the document is easily parseable and navigable using github's "outline" view:
Screenshot 2024-07-08 at 10 09 25 PM

I have:

  • Read and followed Crossplane's contribution process.
  • Run earthly +reviewable to ensure this PR is ready for review.
  • Added or updated unit tests.
  • Added or updated e2e tests.
  • Linked a PR or a docs tracking issue to document this change.
  • Added backport release-x.y labels to auto-backport this PR.

Need help with this checklist? See the cheat sheet.

@jbw976 jbw976 requested review from a team and negz as code owners July 9, 2024 05:11
@jbw976 jbw976 requested a review from turkenh July 9, 2024 05:11

The change log entry data will be serialized in a binary protobuf format for
transmission over the gRPC connection, while the data will be persisted to the
pod logs in a JSON serialized format.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not marshal into a JSON blob and send a byte stream via grpc?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I can see the general idea making sense to keep the format consistent all the way through instead of transforming back and forth. I'll need to learn a bit more details about gRPC connections and their transfer modes though, i.e. what may we lose or have to do differently by using that transfer mode instead.

@jbw976 jbw976 force-pushed the design-change-logs branch from dcb65ca to 714b9a2 Compare July 10, 2024 18:36
| Before operation desired/observed state of resource | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider` |
| After operation desired/observed state of resource | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider` |
| Result of operation | `success` or error object |
| (optional) Additional information that Providers can set as they choose | `{JSON object of arbitrary data}` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was a separate discussion with @luxas on whether we can use the change logs feature to record data about all the reconciliations (for example, reconciliation timestamps), and not just about those reconciliations which attempt to correct drifts. Then the after observation would become optional (only recorded for those reconciliations that update the external resource). crossplane-runtime may not need to be opinionated on this, we would extend crossplane-runtime, and if needed, the package manager and the like, so that the side car (the gRPC server) and the client are properly setup, and the provider could use this infrastructure to implement some related features. The proposed arbitrary data field opens up the possibility for similar cases but the crossplane-runtime had better be un-opinionated about them, if we would like to allow other use cases.

Re-observing for the after-update external resource state would be the issue in this case, if we take the control from the managed reconciler. I think we would not like to have the external client to make a second observation on its own. One idea could be to allow the external client to be able to signal to the managed reconciler the need for an immediate reconciliation so that the managed reconciler requeues an immediate rate limited reconciliation request without waiting for the next poll period.

How do you folks feel about allowing the providers to record more reconciliation data, not limited to updates?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you folks feel about allowing the providers to record more reconciliation data, not limited to updates?

I don't feel like I understand well enough to answer the question. What are some examples of reconciliation data other than timestamps?

@jbw976 jbw976 force-pushed the design-change-logs branch from 714b9a2 to 7afb093 Compare July 12, 2024 23:55
Copy link
Copy Markdown
Member Author

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed review @sttts! I have some more things to think through (e.g. how the sidecar pod could be injected and the format of the data transfer on the wire), but I've included responses to all of your comments! 🙇


The change log entry data will be serialized in a binary protobuf format for
transmission over the gRPC connection, while the data will be persisted to the
pod logs in a JSON serialized format.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I can see the general idea making sense to keep the format consistent all the way through instead of transforming back and forth. I'll need to learn a bit more details about gRPC connections and their transfer modes though, i.e. what may we lose or have to do differently by using that transfer mode instead.

@jbw976 jbw976 force-pushed the design-change-logs branch 2 times, most recently from bc96476 to a8ac33e Compare July 22, 2024 23:47
@jbw976
Copy link
Copy Markdown
Member Author

jbw976 commented Jul 22, 2024

Thanks folks for the reviews and discussions so far, really appreciate it!

I spent some cycles last week prototyping an end to end effort to better understand the viability of the proposed design and how all the components fit together. I've responded to most all comments now, and I've pushed a few new commits with additional and clarifying content for:

  • DeploymentRuntimeConfig works great to inject sidecar container, we can rely on this instead of making changes to package manager. An example was added of what this looks like.
  • Updated the scope and description of code changes we will make across the various Crossplane ecosystem components in our implementation rollout.
  • Added further details about why Kubernetes audit logs are not an ideal fit for our needs.
  • Added an alternative considered section for cloud provider audit logs too.

@jbw976 jbw976 force-pushed the design-change-logs branch from a8ac33e to 69cb87d Compare July 24, 2024 16:27
The code paths through the managed reconciler's `Reconcile()` method are already
a bit complex. One design choice that has helped minimize the complexity and
assist in readability is that the reconciler returns as early as it can after
performing an operation. Since we will now be adding an additional `Observe()`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Observe means one more extra call to the cloud provider API and will increase probability of API ratelimiting. At least for Creates, we always queue in the managed reconcilers and have a consecutive Observe, which we can leverage somehow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the additional Observe() call is a real concern, we already get complaints that Crossplane is making too many API calls. I think it would be logical to requeue after a CUD operation and use that Observe() call as that "after" state for the previous operation. Maybe use an "operation in-progress" flag to know that there has been a CUD request that needs verification.

Copy link
Copy Markdown
Member Author

@jbw976 jbw976 Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely hear the concerns about the extra Observe() call. The usefulness of it is even being called into question with #5822 (comment) 😊

I'm not against cutting the "after" state completely, at least from this initial alpha implementation. I think we still tell an interesting story about WHY we made a change without having that data.

Not implementing an "after" state may be preferable instead of having to manage state across multiple reconciles to link together the Observe() results from a subsequent reconcile with the Create() or Update() operation from a previous reconcile. As things are currently implemented (crossplane/crossplane-runtime@master...jbw976:crossplane-runtime:change-logs), everything is contained within a single reconcile and is definitely more simple for that 😇

So in this alpha implementation, we could:

  1. perform the extra Observe() call and see the feedback from early adopters
  2. cut the "after" state functionality and do nothing
  3. code something more complex that connects Observe() state to Create()/Update()/Delete() observations across reconciliations

Strong opinions from you all? #3 is my least preferred I think 😂

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think alpha is a great time to see what the impact of the second Observe() call would be. If it's bad we can always cut it when it goes beta.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always put it under it's own nested flag so the feature could run with or without the second Observe() call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @negz a bit more about "after" state and a 2nd Observe() call:

  • From the first Observe() call, we should have a very up to date status.atProvider for any modern managed resource
  • The difference between spec.forProvider and status.atProvider should fairly conclusively indicate why we are performing an operation
  • The content of an "after" state collection from a 2nd Observe() is likely to not be any different from the "before", and therefore have limited value as called out in design: cloud change logs #5822 (comment)

Therefore, it's reasonable to remove from this design the "after" state and thus any reason to perform a 2nd Observe(). Without it, we still have a useful record of why we are making changes, whether that be because the spec's desired state has changed, we detected drift in the external resource, etc.

Perhaps Providers could be extended further in the future to provide more details in their ExternalObservation about why they concluded ResourceUpToDate: false and we can include that information in the change logs. For example, there is a ExternalObservation.Diff field, but I don't believe many Providers are currently setting it.

Copy link
Copy Markdown
Contributor

@bobh66 bobh66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think it's looking good - my main concern is the extra Observe and the impact it has on the provider API usage/throttling.

The code paths through the managed reconciler's `Reconcile()` method are already
a bit complex. One design choice that has helped minimize the complexity and
assist in readability is that the reconciler returns as early as it can after
performing an operation. Since we will now be adding an additional `Observe()`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the additional Observe() call is a real concern, we already get complaints that Crossplane is making too many API calls. I think it would be logical to requeue after a CUD operation and use that Observe() call as that "after" state for the previous operation. Maybe use an "operation in-progress" flag to know that there has been a CUD request that needs verification.

jbw976 added 5 commits August 1, 2024 08:34
Signed-off-by: Jared Watts <jbw976@gmail.com>
Signed-off-by: Jared Watts <jbw976@gmail.com>
* `DeploymentRuntimeConfig` works great to inject sidecar container,
  we can rely on this instead of making changes to package manager.
* Updated the scope and description of code changes we will make
  across the various Crossplane ecosystem components.
* Added further details about why Kubernetes audit logs are not an
  ideal fit for our needs.

Signed-off-by: Jared Watts <jbw976@gmail.com>
Signed-off-by: Jared Watts <jbw976@gmail.com>
@jbw976 jbw976 force-pushed the design-change-logs branch from 69cb87d to 11c2b14 Compare August 1, 2024 15:53
jbw976 added 2 commits August 1, 2024 10:09
Signed-off-by: Jared Watts <jbw976@gmail.com>
Signed-off-by: Jared Watts <jbw976@gmail.com>
Copy link
Copy Markdown
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, great write-up 🙌

| Provider name | `xpkg.upbound.io/upbound/provider-aws-ec2:v1.8.0` |
| Type (GVK) | `apiVersion: ec2.aws.upbound.io/v1beta2, kind: Instance` |
| Name | `dev-instance-bt66d` |
| External Name | `vpc-4fa03d9fb92dfec50` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can include resource labels?

Assume we aggregate these changelog events into a search index. We'd then want to correlate them with higher-level events (ArgoCD or FluxCD resource events) and finally point to a commit (or set of commits).

Having the resource labels as KVs in this schema would allow us to build the relationship with those higher-level events more directly, providing a solid end-to-end view into how changes at the higher-end of the abstraction flow to changes at the lowest end.

Signed-off-by: Jared Watts <jbw976@gmail.com>
@vijaymateti
Copy link
Copy Markdown

@negz It would be great if this design doc can be merged as cloud logs already made it to latest release and not much documentation is around for the same. Thank you!

@negz negz merged commit 62dedb8 into crossplane:main Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants