🐛clustercatalog_controller: hacky, but more correct status updating in reconciler#424
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #424 +/- ##
==========================================
+ Coverage 35.28% 38.55% +3.26%
==========================================
Files 14 14
Lines 802 843 +41
==========================================
+ Hits 283 325 +42
- Misses 472 474 +2
+ Partials 47 44 -3 ☔ View full report in Codecov by Sentry. |
| Unpacker source.Unpacker | ||
| Storage storage.Instance | ||
|
|
||
| finalizers crfinalizer.Finalizers |
There was a problem hiding this comment.
I needed to refactor finalizers so that it had access to the reconciler struct. That way it would be possible to delete(r.storedCatalogs, catalog.Name) in the finalizer logic.
| case r.needsPoll(storedCatalog.unpackResult.ResolvedSource.Image.LastSuccessfulPollAttempt.Time, catalog): | ||
| l.Info("unpack required: poll duration has elapsed") | ||
| needsUnpack = true | ||
| } |
There was a problem hiding this comment.
I intentionally added lots of logging here to help us deduce why we might be unpacking more frequently than expected.
| clearUnknownConditions(expectedStatus) | ||
| if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) { | ||
| updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration) | ||
| updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil) |
There was a problem hiding this comment.
I needed to change the updateStatusProgressing signature so that I could call it with a deep-copied status, and a different observed generation.
| if tt.digestAlreadyExists { | ||
| assert.Contains(t, buf.String(), "image already unpacked") | ||
| assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime()) | ||
| assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second)) |
There was a problem hiding this comment.
I think even asserting on the second could be brittle if we happen to run on the second to next second break. How about just asserting the duration is in a reasonable range?
| assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second)) | |
| assert.WithinDuration(t, rs.UnpackTime, unpackDirStat.ModTime(), time.Second) |
There was a problem hiding this comment.
This one should be exact. rs.UnpackTime is always truncated to the second in the same way. I'll add a comment in the code about why we need that though.
|
@joelanford when I point current operator-controller main at this with a # github.com/operator-framework/operator-controller/internal/catalogmetadata/cache
internal/catalogmetadata/cache/cache.go:86:42: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:122:62: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:164:52: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers [github.com/operator-framework/operator-controller/internal/controllers.test]
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
make: *** [vet] Error 1 |
797ff05 to
2a05670
Compare
|
@bentito, that's expected. See the accompanying changes in operator-framework/operator-controller#1329. No matter how you slice it, there is a breaking change from 0.29.0 to 0.30.0 (and from 0.29.0 to this PR):
|
ah, okay, understanding that now, was just worried it was supposed to be fixed from catalogd side alone |
ankitathomas
left a comment
There was a problem hiding this comment.
Looks good! Left a small comment about the status comparison
| clearUnknownConditions(expectedStatus) | ||
| if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) { | ||
| updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration) | ||
| updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil) |
There was a problem hiding this comment.
Is this to avoid waiting till the next poll before reconciling in case of unpack errors? Looks like the equivalence check for the Progressing condition will always fail if we had an error on the last unpack; do we backoff between reconciling errored unpacks?
There was a problem hiding this comment.
Oh this is a good question. Here's a scenario:
- We are successfully serving content for generation 1
- We reconcile generation 2, which has a bad image ref, let's say the tag doesn't exist. So we:
- Do not update
storedCatalogmap - DO update the status with
Progressing=True, Retrying
- Do not update
- On the next reconcile (that is requeued basically immediately), we build an "expectedStatus" that says
Installed=TrueandProgressing=Falsebased on this code. We later compare this to the real status from the object and we are told that there are differences, which forces an unpack.
Step 3 is a little odd though. Maybe if we return an error from Reconcile we should delete(r.storedCatalogs, catalog.Name) as a better signal for "we need to attempt to unpack again"?
| updateStatusProgressing(&catalog.Status, catalog.GetGeneration(), storageErr) | ||
| return ctrl.Result{}, storageErr |
There was a problem hiding this comment.
considering this isn't a terminal error, should we requeue here?
There was a problem hiding this comment.
By returning the storageErr, we get requeue based on exponential backoff. We only want the requeue based on poll interval if there are no errors during reconcile.
… reconciler Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
63b0914
2a05670 to
63b0914
Compare
For awhile in catalogd, we've had a very loose "needsUnpack" behavior where we would make the assumption that if we didn't need to unpack, then we also didn't need to update the status of the object.
This means that some objects may never have their status updated after an upgrade. That's problematic because upgrades may add new helpful fields or fix bugs in the way that status is reported.
This PR is a somewhat hacky fix for the situation. It solves the problem by keeping an in-memory map of the observedGeneration and resolved source, and then using that as the source of truth to and then use that expected status as part of the determination of whether an unpack is required.
A new criteria of "needsUnpack" is now: "does the expected status differ from the status that is on the object. If so, unpack again.
Fixes #422
Fixes #420