Skip OCI artifact fallback on retryable errors#9778
Conversation
|
/cherry-pick release-1.35 |
|
@bitoku: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a fallback path in image pulling: on certain copy.Image failures the code may attempt an OCI-artifact pull (with zero retries), returning combined errors on failure and constructing a reference from the artifact manifest digest on success; adds a test ensuring network-unreachable errors do not trigger the fallback. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client/Pod
participant Handler as pullImageImplementation
participant CopyOp as copy.Image
participant Checker as shouldTryArtifact
participant Artifact as OCI Artifact Store
Client->>Handler: Request image pull
Handler->>CopyOp: Attempt standard copy/pull
CopyOp-->>Handler: Error
Handler->>Checker: shouldTryArtifact(err)
alt shouldTryArtifact == false (network/unreachable)
Checker-->>Handler: false
Handler->>Client: Return original pull error
else shouldTryArtifact == true
Checker-->>Handler: true
Handler->>Artifact: Attempt artifact pull (MaxRetries=0)
Artifact-->>Handler: Success or Error
alt Artifact success
Handler->>Client: Return image ref built from artifact manifest digest
else Artifact failed
Handler->>Client: Return combined original + artifact errors
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/storage/image.go (2)
902-910:⚠️ Potential issue | 🟡 MinorGuard against a nil
artifactManifestDigestbefore dereferencing.If
artifactStore.Pullever returns(nil, nil),*artifactManifestDigeston line 906 will panic. Adding an explicit nil check makes the invariant explicit and prevents a runtime crash.🛡️ Suggested fix
if artifactErr != nil { return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: pull image err: %w; artifact err: %w", err, artifactErr) } + if artifactManifestDigest == nil { + return RegistryImageReference{}, fmt.Errorf("unable to pull OCI artifact: Pull succeeded but returned a nil digest") + } canonicalRef, err := reference.WithDigest(reference.TrimNamed(imageName.Raw()), *artifactManifestDigest)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 902 - 910, The code dereferences artifactManifestDigest when creating the canonicalRef (call to reference.WithDigest(reference.TrimNamed(imageName.Raw()), *artifactManifestDigest)) and can panic if artifactStore.Pull returned (nil, nil); add an explicit nil check for artifactManifestDigest after the Pull result and before calling reference.WithDigest, returning a clear error (e.g., "nil artifact manifest digest") if nil to avoid a runtime panic.
888-891:⚠️ Potential issue | 🟡 MinorOriginal copy error is silently dropped on store-creation failure.
When
ociartifact.NewStorefails, the returned error omits the originalcopy.Imageerror, losing the root cause from the diagnostic chain.🛡️ Suggested fix
- return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: create store err: %w", artifactErr) + return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: copy err: %w; create store err: %w", err, artifactErr)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 888 - 891, The error returned from ociartifact.NewStore (artifactErr) may wrap a lower-level copy.Image error that is currently not surfaced; update the error return in the artifactErr branch to include the original/inner error information when present (use errors.Unwrap or inspect artifactErr for an inner error) and wrap artifactErr with fmt.Errorf so the full diagnostic chain is preserved; update the block around ociartifact.NewStore, artifactStore and artifactErr to return fmt.Errorf("unable to pull image or OCI artifact: create store err: %v: %w", rootCause, artifactErr) where rootCause is the unwrapped inner error (fall back to artifactErr if Unwrap returns nil).
🧹 Nitpick comments (1)
internal/storage/image.go (1)
833-847:context.Canceledfalls through toreturn true, triggering an unnecessary artifact attempt.
context.DeadlineExceededsatisfiesnet.Error(it hasTimeout()/Temporary()) and is already excluded. However,context.Canceleddoes not implementnet.Error, so a cancelled-context failure fromcopy.Imagewould still attempt the artifact pull — with the already-cancelled context — and produce a confusing combined error.♻️ Suggested addition
func shouldTryArtifact(err error) bool { var netErr net.Error switch { case err == nil: return false + case errors.Is(err, context.Canceled): + // Context was already cancelled; an artifact pull would fail + // the same way immediately. + return false case errors.As(err, &netErr): // Network errors indicate the registry is unreachable, so an artifact // pull would fail the same way. Rely on Kubelet retries instead. return false default: return true } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 833 - 847, shouldTryArtifact currently treats context.Canceled as a regular error and returns true, causing pointless artifact attempts; update shouldTryArtifact to explicitly detect context.Canceled (using errors.Is(err, context.Canceled)) and return false before the switch/after the nil check so cancelled contexts do not trigger artifact pulls; keep the existing net.Error check and behavior for context.DeadlineExceeded intact and ensure you reference the shouldTryArtifact function when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/storage/image.go`:
- Around line 902-910: The code dereferences artifactManifestDigest when
creating the canonicalRef (call to
reference.WithDigest(reference.TrimNamed(imageName.Raw()),
*artifactManifestDigest)) and can panic if artifactStore.Pull returned (nil,
nil); add an explicit nil check for artifactManifestDigest after the Pull result
and before calling reference.WithDigest, returning a clear error (e.g., "nil
artifact manifest digest") if nil to avoid a runtime panic.
- Around line 888-891: The error returned from ociartifact.NewStore
(artifactErr) may wrap a lower-level copy.Image error that is currently not
surfaced; update the error return in the artifactErr branch to include the
original/inner error information when present (use errors.Unwrap or inspect
artifactErr for an inner error) and wrap artifactErr with fmt.Errorf so the full
diagnostic chain is preserved; update the block around ociartifact.NewStore,
artifactStore and artifactErr to return fmt.Errorf("unable to pull image or OCI
artifact: create store err: %v: %w", rootCause, artifactErr) where rootCause is
the unwrapped inner error (fall back to artifactErr if Unwrap returns nil).
---
Nitpick comments:
In `@internal/storage/image.go`:
- Around line 833-847: shouldTryArtifact currently treats context.Canceled as a
regular error and returns true, causing pointless artifact attempts; update
shouldTryArtifact to explicitly detect context.Canceled (using errors.Is(err,
context.Canceled)) and return false before the switch/after the nil check so
cancelled contexts do not trigger artifact pulls; keep the existing net.Error
check and behavior for context.DeadlineExceeded intact and ensure you reference
the shouldTryArtifact function when making the change.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/storage/image.go (1)
833-847: Consider also skipping artifact fallback on context cancellation/deadline errors.
context.Canceledandcontext.DeadlineExceededdon't implementnet.Error, so a pull that times out via context will still attempt the artifact fallback. If the parent context is already done, the artifact pull will fail immediately anyway, but it adds unnecessary code path execution and potentially a confusing combined error message. You may want to add a case forctx.Err() != nilorerrors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded).Suggested change
func shouldTryArtifact(err error) bool { var netErr net.Error switch { case err == nil: return false case errors.As(err, &netErr): // Network errors indicate the registry is unreachable, so an artifact // pull would fail the same way. Rely on Kubelet retries instead. return false + case errors.Is(err, context.Canceled), errors.Is(err, context.DeadlineExceeded): + // Context is already done; the artifact pull would fail immediately. + return false default: return true } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 833 - 847, The shouldTryArtifact function should also skip artifact fallback for context cancellation/deadline errors; update the switch to check for context.Canceled and context.DeadlineExceeded (e.g. using errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded)) and return false in that case, so that shouldTryArtifact returns false when the pull failed due to a cancelled or timed-out context rather than only for net.Error cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/image.bats`:
- Around line 417-425: The grep assertion uses a bare negation (! grep -q ...)
which can be unreliable in Bats; change that assertion to use Bats' run wrapper
with negation (i.e., replace the bare "! grep -q "Falling back" "$CRIO_LOG""
with "run ! grep -q "Falling back" "$CRIO_LOG"") so the test name "image pull
should not fall back to OCI artifact on network error" reliably fails/passes
under CI and references the same CRIO_LOG variable.
---
Nitpick comments:
In `@internal/storage/image.go`:
- Around line 833-847: The shouldTryArtifact function should also skip artifact
fallback for context cancellation/deadline errors; update the switch to check
for context.Canceled and context.DeadlineExceeded (e.g. using errors.Is(err,
context.Canceled) || errors.Is(err, context.DeadlineExceeded)) and return false
in that case, so that shouldTryArtifact returns false when the pull failed due
to a cancelled or timed-out context rather than only for net.Error cases.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/storage/image.go (1)
833-887:⚠️ Potential issue | 🟡 MinorPass
ctxintoshouldTryArtifactto honor cancellation early.Right now a canceled pull can still trigger the artifact fallback/logging path. Thread
ctxthrough and short‑circuit onctx.Err()to avoid extra work and noise.💡 Suggested change
-func shouldTryArtifact(err error) bool { +func shouldTryArtifact(ctx context.Context, err error) bool { var netErr net.Error switch { case err == nil: return false + case ctx != nil && ctx.Err() != nil: + return false case errors.As(err, &netErr): // Network errors indicate the registry is unreachable, so an artifact // pull would fail the same way. Rely on Kubelet retries instead. return false default: return true } } @@ - if shouldTryArtifact(err) { + if shouldTryArtifact(ctx, err) {As per coding guidelines: Propagate context.Context through function calls in Go code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 833 - 887, The fallback logic may run after cancellation because shouldTryArtifact doesn't observe ctx; change shouldTryArtifact to accept a context.Context (func shouldTryArtifact(ctx context.Context, err error) bool) and make it return false immediately if ctx.Err() != nil, then update the call site in pullImageImplementation (and any other callers) to pass the current ctx (replace shouldTryArtifact(err) with shouldTryArtifact(ctx, err)); keep the existing net.Error check and other behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/storage/image.go`:
- Around line 833-887: The fallback logic may run after cancellation because
shouldTryArtifact doesn't observe ctx; change shouldTryArtifact to accept a
context.Context (func shouldTryArtifact(ctx context.Context, err error) bool)
and make it return false immediately if ctx.Err() != nil, then update the call
site in pullImageImplementation (and any other callers) to pass the current ctx
(replace shouldTryArtifact(err) with shouldTryArtifact(ctx, err)); keep the
existing net.Error check and other behavior unchanged.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/storage/image.gointernal/storage/image_test.gotest/image.bats
💤 Files with no reviewable changes (1)
- internal/storage/image_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/image.bats
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9778 +/- ##
==========================================
+ Coverage 64.36% 64.44% +0.07%
==========================================
Files 210 210
Lines 29141 29162 +21
==========================================
+ Hits 18756 18792 +36
+ Misses 8692 8681 -11
+ Partials 1693 1689 -4 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/storage/image.go`:
- Around line 918-929: In shouldTryArtifact(err error) detect and short-circuit
context cancellation/errors first: check errors.Is(err, context.Canceled) and
errors.Is(err, context.DeadlineExceeded) and return false before testing for
net.Error; this ensures context cancellations and deadline exceeded cases do not
fallthrough to artifact fallback and still preserves the existing errors.As(err,
&netErr) logic and the default return true behavior.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/storage/image.gointernal/storage/image_test.gotest/image.bats
💤 Files with no reviewable changes (1)
- internal/storage/image_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/image.bats
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/storage/image.go (1)
916-933:shouldTryArtifactlooks correct.The
context.Canceled/context.DeadlineExceededshort-circuit at line 923 is properly in place, theerrors.Asnet-error check covers all wrapped network failures, and the "why" comment at lines 927–929 is clear. The function is defined after its call site, consistent with Go convention and the prior nit being acted on.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 916 - 933, No changes required: the shouldTryArtifact function correctly handles nil, context.Canceled, context.DeadlineExceeded, and wrapped network errors via errors.As, so leave shouldTryArtifact as-is (no code edits needed) and mark the review as approved for function shouldTryArtifact.
🧹 Nitpick comments (1)
internal/storage/image.go (1)
873-876: Consider including the original copy error in the store-creation failure path.Line 875 wraps only
artifactErr, while line 888 wraps both the originalcopy.Imageerror and the artifact error. If the store can't be created, the caller loses the reason the initial image pull failed, making it harder to triage.♻️ Proposed fix
- return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: create store err: %w", artifactErr) + return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: pull image err: %w; create store err: %w", err, artifactErr)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/image.go` around lines 873 - 876, The store-creation failure currently returns only artifactErr from ociartifact.NewStore, which loses the original copy.Image error; update the error return in the ociartifact.NewStore failure path to include the earlier image copy error (the variable produced by copy.Image / the image pull call) alongside artifactErr so callers see both causes (e.g., mention the copy error and the store creation error in the fmt.Errorf message), referencing ociartifact.NewStore, artifactErr and the earlier copy/Image error variable to locate and fix the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/storage/image.go`:
- Around line 916-933: No changes required: the shouldTryArtifact function
correctly handles nil, context.Canceled, context.DeadlineExceeded, and wrapped
network errors via errors.As, so leave shouldTryArtifact as-is (no code edits
needed) and mark the review as approved for function shouldTryArtifact.
---
Nitpick comments:
In `@internal/storage/image.go`:
- Around line 873-876: The store-creation failure currently returns only
artifactErr from ociartifact.NewStore, which loses the original copy.Image
error; update the error return in the ociartifact.NewStore failure path to
include the earlier image copy error (the variable produced by copy.Image / the
image pull call) alongside artifactErr so callers see both causes (e.g., mention
the copy error and the store creation error in the fmt.Errorf message),
referencing ociartifact.NewStore, artifactErr and the earlier copy/Image error
variable to locate and fix the code.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/storage/image.gointernal/storage/image_test.gotest/image.bats
💤 Files with no reviewable changes (1)
- internal/storage/image_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/image.bats
| switch { | ||
| case err == nil: | ||
| return false | ||
| case retry.IsErrorRetryable(err): |
There was a problem hiding this comment.
There was a problem hiding this comment.
This function is documented as unstable. The function explicitly says "That heuristic is NOT STABLE and it CAN CHANGE AT ANY TIME." Using it as a gate for the artifact fallback means the artifact fallback behavior is coupled to an unstable heuristic.
There was a problem hiding this comment.
right, I fixed it only skip fall back when there's network timeout, and other context errors.
|
/retest LGTM |
|
@saschagrunert Can you PTAL? |
| switch { | ||
| case err == nil: | ||
| return false | ||
| case retry.IsErrorRetryable(err): |
There was a problem hiding this comment.
This function is documented as unstable. The function explicitly says "That heuristic is NOT STABLE and it CAN CHANGE AT ANY TIME." Using it as a gate for the artifact fallback means the artifact fallback behavior is coupled to an unstable heuristic.
| if netError.Timeout() { | ||
| // Network errors are transient; retry as image pull instead of falling back. | ||
| return false | ||
| } |
There was a problem hiding this comment.
Non-timeout net.Error (e.g. ECONNREFUSED, ENETUNREACH, EHOSTUNREACH) falls through to return true, so the artifact fallback still runs against the same unreachable registry. Consider returning false for all net.Error matches, they are equally transient and the artifact pull would fail the same way.
| @test "image pull should not fall back to OCI artifact on network error" { | ||
| start_crio | ||
|
|
||
| # 192.0.2.1 is TEST-NET-1 (RFC 5737), guaranteed unreachable |
There was a problem hiding this comment.
This assumes 192.0.2.1 produces a net.Error with Timeout() == true (silently dropped packets). Depending on firewall/routing rules, it may return EHOSTUNREACH or ENETUNREACH instead (Timeout() == false), causing shouldTryArtifact to return true and the fallback to run.
There was a problem hiding this comment.
yeah my initial thought was that it can be imperfect because it just ends up with another pull.
And with #9782 CRI-O won't pull images as artifacts accidentally.
But I added the check, that if the error is net.Error, fail the whole pull.
| return references.RegistryImageReferenceFromRaw(canonicalRef), nil | ||
| } | ||
|
|
||
| if err != nil { |
There was a problem hiding this comment.
Nit: other error paths in this function use "unable to pull image..." this one says "unable to copy image".
| var netError net.Error | ||
| switch { | ||
| case err == nil: | ||
| return false | ||
| case errors.Is(err, context.Canceled), errors.Is(err, context.DeadlineExceeded): | ||
| // Caller-initiated cancellation or timeout; no point retrying as artifact. | ||
| return false |
There was a problem hiding this comment.
errors.Is(err, context.DeadlineExceeded) is technically redundant with errors.As(err, &netError) since Go's deadlineExceededError satisfies net.Error. But keeping it explicit alongside context.Canceled is fine for readability and intent.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bitoku, haircommander, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
1 similar comment
|
/retest |
Use net.Error timeout detection to skip the OCI artifact fallback for transient network timeouts that should be retried as image pulls instead. For all other errors, optimistically fall back to an artifact pull since the image reference may actually be an artifact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
|
@saschagrunert |
|
/lgtm |
|
@bitoku: #9778 failed to apply on top of branch "release-1.35": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@bitoku: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Avoid falling back to an OCI artifact pull when the initial image pull fails due to a network error, since the artifact pull would fail the same way. This prevents blocking pod operations for an extra timeout cycle on unreachable registries. Also disable retries on the artifact pull path to further reduce latency on failures.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Summary by CodeRabbit
New Features
Bug Fixes
Tests