Skip to content
This repository was archived by the owner on Mar 3, 2025. It is now read-only.

🐛clustercatalog_controller: hacky, but more correct status updating in reconciler#424

Merged
joelanford merged 1 commit into
operator-framework:mainfrom
joelanford:fix-needs-unpack
Oct 4, 2024
Merged

🐛clustercatalog_controller: hacky, but more correct status updating in reconciler#424
joelanford merged 1 commit into
operator-framework:mainfrom
joelanford:fix-needs-unpack

Conversation

@joelanford

@joelanford joelanford commented Oct 3, 2024

Copy link
Copy Markdown
Member

For awhile in catalogd, we've had a very loose "needsUnpack" behavior where we would make the assumption that if we didn't need to unpack, then we also didn't need to update the status of the object.

This means that some objects may never have their status updated after an upgrade. That's problematic because upgrades may add new helpful fields or fix bugs in the way that status is reported.

This PR is a somewhat hacky fix for the situation. It solves the problem by keeping an in-memory map of the observedGeneration and resolved source, and then using that as the source of truth to and then use that expected status as part of the determination of whether an unpack is required.

A new criteria of "needsUnpack" is now: "does the expected status differ from the status that is on the object. If so, unpack again.

Fixes #422
Fixes #420

@joelanford joelanford requested a review from a team as a code owner October 3, 2024 20:20
@codecov

codecov Bot commented Oct 3, 2024

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 85.10638% with 14 lines in your changes missing coverage. Please review.

Project coverage is 38.55%. Comparing base (8137da0) to head (63b0914).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rnal/controllers/core/clustercatalog_controller.go 87.20% 11 Missing ⚠️
cmd/manager/main.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #424      +/-   ##
==========================================
+ Coverage   35.28%   38.55%   +3.26%     
==========================================
  Files          14       14              
  Lines         802      843      +41     
==========================================
+ Hits          283      325      +42     
- Misses        472      474       +2     
+ Partials       47       44       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Unpacker source.Unpacker
Storage storage.Instance

finalizers crfinalizer.Finalizers

@joelanford joelanford Oct 3, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to refactor finalizers so that it had access to the reconciler struct. That way it would be possible to delete(r.storedCatalogs, catalog.Name) in the finalizer logic.

case r.needsPoll(storedCatalog.unpackResult.ResolvedSource.Image.LastSuccessfulPollAttempt.Time, catalog):
l.Info("unpack required: poll duration has elapsed")
needsUnpack = true
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally added lots of logging here to help us deduce why we might be unpacking more frequently than expected.

clearUnknownConditions(expectedStatus)
if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) {
updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration)
updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to change the updateStatusProgressing signature so that I could call it with a deep-copied status, and a different observed generation.

if tt.digestAlreadyExists {
assert.Contains(t, buf.String(), "image already unpacked")
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime())
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even asserting on the second could be brittle if we happen to run on the second to next second break. How about just asserting the duration is in a reasonable range?

Suggested change
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second))
assert.WithinDuration(t, rs.UnpackTime, unpackDirStat.ModTime(), time.Second)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be exact. rs.UnpackTime is always truncated to the second in the same way. I'll add a comment in the code about why we need that though.

@bentito

bentito commented Oct 3, 2024

Copy link
Copy Markdown
Contributor

@joelanford when I point current operator-controller main at this with a replace and run make test-upgrade-e2e I am still seeing:

# github.com/operator-framework/operator-controller/internal/catalogmetadata/cache
internal/catalogmetadata/cache/cache.go:86:42: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:122:62: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:164:52: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers [github.com/operator-framework/operator-controller/internal/controllers.test]
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
make: *** [vet] Error 1

@joelanford

Copy link
Copy Markdown
Member Author

@bentito, that's expected. See the accompanying changes in operator-framework/operator-controller#1329.

No matter how you slice it, there is a breaking change from 0.29.0 to 0.30.0 (and from 0.29.0 to this PR):

  1. We removed some status fields that operator-controller depends on, and
  2. We changed the spec.source.type enum from image to Image, with no backcompat support. So the upgrade tests are bound to fail because the new catalogd no longer understands old data (i.e. spec.source.type: "image")

@bentito

bentito commented Oct 4, 2024

Copy link
Copy Markdown
Contributor

@bentito, that's expected. See the accompanying changes in operator-framework/operator-controller#1329.

No matter how you slice it, there is a breaking change from 0.29.0 to 0.30.0 (and from 0.29.0 to this PR):

1. We removed some status fields that operator-controller depends on, and

2. We changed the `spec.source.type` enum from `image` to `Image`, with no backcompat support. So the upgrade tests are bound to fail because the new catalogd no longer understands old data (i.e. `spec.source.type: "image"`)

ah, okay, understanding that now, was just worried it was supposed to be fixed from catalogd side alone

ankitathomas
ankitathomas previously approved these changes Oct 4, 2024

@ankitathomas ankitathomas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a small comment about the status comparison

clearUnknownConditions(expectedStatus)
if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) {
updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration)
updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil)

@ankitathomas ankitathomas Oct 4, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to avoid waiting till the next poll before reconciling in case of unpack errors? Looks like the equivalence check for the Progressing condition will always fail if we had an error on the last unpack; do we backoff between reconciling errored unpacks?

@joelanford joelanford Oct 4, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is a good question. Here's a scenario:

  1. We are successfully serving content for generation 1
  2. We reconcile generation 2, which has a bad image ref, let's say the tag doesn't exist. So we:
    • Do not update storedCatalog map
    • DO update the status with Progressing=True, Retrying
  3. On the next reconcile (that is requeued basically immediately), we build an "expectedStatus" that says Installed=True and Progressing=False based on this code. We later compare this to the real status from the object and we are told that there are differences, which forces an unpack.

Step 3 is a little odd though. Maybe if we return an error from Reconcile we should delete(r.storedCatalogs, catalog.Name) as a better signal for "we need to attempt to unpack again"?

Comment on lines +215 to 224
updateStatusProgressing(&catalog.Status, catalog.GetGeneration(), storageErr)
return ctrl.Result{}, storageErr

@ankitathomas ankitathomas Oct 4, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering this isn't a terminal error, should we requeue here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning the storageErr, we get requeue based on exponential backoff. We only want the requeue based on poll interval if there are no errors during reconcile.

everettraven
everettraven previously approved these changes Oct 4, 2024
… reconciler

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@joelanford joelanford enabled auto-merge October 4, 2024 20:40
@joelanford joelanford added this pull request to the merge queue Oct 4, 2024
Merged via the queue into operator-framework:main with commit dee3337 Oct 4, 2024
@joelanford joelanford deleted the fix-needs-unpack branch October 22, 2024 13:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: when needsUnpack returns false no status update happens Avoid using resource status for reconcile logic

5 participants