Skip to content

Skip OCI artifact fallback on retryable errors#9778

Merged
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
bitoku:retry
Feb 27, 2026
Merged

Skip OCI artifact fallback on retryable errors#9778
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
bitoku:retry

Conversation

@bitoku

@bitoku bitoku commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Avoid falling back to an OCI artifact pull when the initial image pull fails due to a network error, since the artifact pull would fail the same way. This prevents blocking pod operations for an extra timeout cycle on unreachable registries. Also disable retries on the artifact pull path to further reduce latency on failures.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Skip the OCI artifact pull fallback when the initial image pull fails due to a retryable error

Summary by CodeRabbit

  • New Features

    • Added an OCI-artifact fallback for image pulls when primary pulls fail, with debug logging and final image reference construction from artifact digest; artifact pulls use zero retries to avoid blocking.
  • Bug Fixes

    • Returns clearer combined errors when primary and artifact pull attempts fail.
    • Detects network-related failures and avoids artifact fallback to prevent delays.
  • Tests

    • Added a test ensuring network errors do not trigger artifact fallback and cleaned up related test scaffolding.

@bitoku bitoku requested a review from mrunalp as a code owner February 24, 2026 18:07
@openshift-ci openshift-ci Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Feb 24, 2026
@openshift-ci openshift-ci Bot requested review from hasan4791 and klihub February 24, 2026 18:07
@bitoku

bitoku commented Feb 24, 2026

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-1.35

@openshift-cherrypick-robot

Copy link
Copy Markdown

@bitoku: once the present PR merges, I will cherry-pick it on top of release-1.35 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai

coderabbitai Bot commented Feb 24, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a fallback path in image pulling: on certain copy.Image failures the code may attempt an OCI-artifact pull (with zero retries), returning combined errors on failure and constructing a reference from the artifact manifest digest on success; adds a test ensuring network-unreachable errors do not trigger the fallback.

Changes

Cohort / File(s) Summary
Image pull logic
internal/storage/image.go
Adds net and k8s.io/utils/ptr imports; introduces shouldTryArtifact(err error) bool; on eligible copy.Image errors logs a debug message and attempts an OCI artifact pull with MaxRetries=0; returns combined errors if artifact pull fails; builds final image ref from artifact manifest digest on success; adds guards to return descriptive errors when appropriate.
Integration test
test/image.bats
Adds test "image pull should not fall back to OCI artifact on network error" that attempts a pull from an unreachable IP and asserts pull fails and logs do not contain "Falling back".
Unit tests (cleanup)
internal/storage/image_test.go
Removes shared BeforeEach graphRoot setup and several mocked Store.GraphRoot() expectations used across PullImage tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Pod
    participant Handler as pullImageImplementation
    participant CopyOp as copy.Image
    participant Checker as shouldTryArtifact
    participant Artifact as OCI Artifact Store

    Client->>Handler: Request image pull
    Handler->>CopyOp: Attempt standard copy/pull
    CopyOp-->>Handler: Error
    Handler->>Checker: shouldTryArtifact(err)
    alt shouldTryArtifact == false (network/unreachable)
        Checker-->>Handler: false
        Handler->>Client: Return original pull error
    else shouldTryArtifact == true
        Checker-->>Handler: true
        Handler->>Artifact: Attempt artifact pull (MaxRetries=0)
        Artifact-->>Handler: Success or Error
        alt Artifact success
            Handler->>Client: Return image ref built from artifact manifest digest
        else Artifact failed
            Handler->>Client: Return combined original + artifact errors
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

ok-to-test

Suggested reviewers

  • mrunalp
  • hasan4791

Poem

🐰 I sniff the pull, then pause to see,

If nets are dark I won't cling to a plea,
For other faults I hop the artifact track,
Combine the clues and build the final pack,
Tiny rabbit cheers — no extra fallback hack.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'Skip OCI artifact fallback on retryable errors' accurately reflects the main change in the PR - adding logic to avoid OCI artifact fallback when network errors occur, and disabling retries on the artifact path to prevent unnecessary delays.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/storage/image.go (2)

902-910: ⚠️ Potential issue | 🟡 Minor

Guard against a nil artifactManifestDigest before dereferencing.

If artifactStore.Pull ever returns (nil, nil), *artifactManifestDigest on line 906 will panic. Adding an explicit nil check makes the invariant explicit and prevents a runtime crash.

🛡️ Suggested fix
 		if artifactErr != nil {
 			return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: pull image err: %w; artifact err: %w", err, artifactErr)
 		}
+		if artifactManifestDigest == nil {
+			return RegistryImageReference{}, fmt.Errorf("unable to pull OCI artifact: Pull succeeded but returned a nil digest")
+		}
 
 		canonicalRef, err := reference.WithDigest(reference.TrimNamed(imageName.Raw()), *artifactManifestDigest)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 902 - 910, The code dereferences
artifactManifestDigest when creating the canonicalRef (call to
reference.WithDigest(reference.TrimNamed(imageName.Raw()),
*artifactManifestDigest)) and can panic if artifactStore.Pull returned (nil,
nil); add an explicit nil check for artifactManifestDigest after the Pull result
and before calling reference.WithDigest, returning a clear error (e.g., "nil
artifact manifest digest") if nil to avoid a runtime panic.

888-891: ⚠️ Potential issue | 🟡 Minor

Original copy error is silently dropped on store-creation failure.

When ociartifact.NewStore fails, the returned error omits the original copy.Image error, losing the root cause from the diagnostic chain.

🛡️ Suggested fix
-		return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: create store err: %w", artifactErr)
+		return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: copy err: %w; create store err: %w", err, artifactErr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 888 - 891, The error returned from
ociartifact.NewStore (artifactErr) may wrap a lower-level copy.Image error that
is currently not surfaced; update the error return in the artifactErr branch to
include the original/inner error information when present (use errors.Unwrap or
inspect artifactErr for an inner error) and wrap artifactErr with fmt.Errorf so
the full diagnostic chain is preserved; update the block around
ociartifact.NewStore, artifactStore and artifactErr to return fmt.Errorf("unable
to pull image or OCI artifact: create store err: %v: %w", rootCause,
artifactErr) where rootCause is the unwrapped inner error (fall back to
artifactErr if Unwrap returns nil).
🧹 Nitpick comments (1)
internal/storage/image.go (1)

833-847: context.Canceled falls through to return true, triggering an unnecessary artifact attempt.

context.DeadlineExceeded satisfies net.Error (it has Timeout()/Temporary()) and is already excluded. However, context.Canceled does not implement net.Error, so a cancelled-context failure from copy.Image would still attempt the artifact pull — with the already-cancelled context — and produce a confusing combined error.

♻️ Suggested addition
 func shouldTryArtifact(err error) bool {
 	var netErr net.Error
 	switch {
 	case err == nil:
 		return false
+	case errors.Is(err, context.Canceled):
+		// Context was already cancelled; an artifact pull would fail
+		// the same way immediately.
+		return false
 	case errors.As(err, &netErr):
 		// Network errors indicate the registry is unreachable, so an artifact
 		// pull would fail the same way. Rely on Kubelet retries instead.
 		return false
 	default:
 		return true
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 833 - 847, shouldTryArtifact
currently treats context.Canceled as a regular error and returns true, causing
pointless artifact attempts; update shouldTryArtifact to explicitly detect
context.Canceled (using errors.Is(err, context.Canceled)) and return false
before the switch/after the nil check so cancelled contexts do not trigger
artifact pulls; keep the existing net.Error check and behavior for
context.DeadlineExceeded intact and ensure you reference the shouldTryArtifact
function when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/storage/image.go`:
- Around line 902-910: The code dereferences artifactManifestDigest when
creating the canonicalRef (call to
reference.WithDigest(reference.TrimNamed(imageName.Raw()),
*artifactManifestDigest)) and can panic if artifactStore.Pull returned (nil,
nil); add an explicit nil check for artifactManifestDigest after the Pull result
and before calling reference.WithDigest, returning a clear error (e.g., "nil
artifact manifest digest") if nil to avoid a runtime panic.
- Around line 888-891: The error returned from ociartifact.NewStore
(artifactErr) may wrap a lower-level copy.Image error that is currently not
surfaced; update the error return in the artifactErr branch to include the
original/inner error information when present (use errors.Unwrap or inspect
artifactErr for an inner error) and wrap artifactErr with fmt.Errorf so the full
diagnostic chain is preserved; update the block around ociartifact.NewStore,
artifactStore and artifactErr to return fmt.Errorf("unable to pull image or OCI
artifact: create store err: %v: %w", rootCause, artifactErr) where rootCause is
the unwrapped inner error (fall back to artifactErr if Unwrap returns nil).

---

Nitpick comments:
In `@internal/storage/image.go`:
- Around line 833-847: shouldTryArtifact currently treats context.Canceled as a
regular error and returns true, causing pointless artifact attempts; update
shouldTryArtifact to explicitly detect context.Canceled (using errors.Is(err,
context.Canceled)) and return false before the switch/after the nil check so
cancelled contexts do not trigger artifact pulls; keep the existing net.Error
check and behavior for context.DeadlineExceeded intact and ensure you reference
the shouldTryArtifact function when making the change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7421a8e and 86384d0.

📒 Files selected for processing (2)
  • internal/storage/image.go
  • test/image.bats

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/storage/image.go (1)

833-847: Consider also skipping artifact fallback on context cancellation/deadline errors.

context.Canceled and context.DeadlineExceeded don't implement net.Error, so a pull that times out via context will still attempt the artifact fallback. If the parent context is already done, the artifact pull will fail immediately anyway, but it adds unnecessary code path execution and potentially a confusing combined error message. You may want to add a case for ctx.Err() != nil or errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded).

Suggested change
 func shouldTryArtifact(err error) bool {
 	var netErr net.Error
 	switch {
 	case err == nil:
 		return false
 	case errors.As(err, &netErr):
 		// Network errors indicate the registry is unreachable, so an artifact
 		// pull would fail the same way. Rely on Kubelet retries instead.
 		return false
+	case errors.Is(err, context.Canceled), errors.Is(err, context.DeadlineExceeded):
+		// Context is already done; the artifact pull would fail immediately.
+		return false
 	default:
 		return true
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 833 - 847, The shouldTryArtifact
function should also skip artifact fallback for context cancellation/deadline
errors; update the switch to check for context.Canceled and
context.DeadlineExceeded (e.g. using errors.Is(err, context.Canceled) ||
errors.Is(err, context.DeadlineExceeded)) and return false in that case, so that
shouldTryArtifact returns false when the pull failed due to a cancelled or
timed-out context rather than only for net.Error cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/image.bats`:
- Around line 417-425: The grep assertion uses a bare negation (! grep -q ...)
which can be unreliable in Bats; change that assertion to use Bats' run wrapper
with negation (i.e., replace the bare "! grep -q "Falling back" "$CRIO_LOG""
with "run ! grep -q "Falling back" "$CRIO_LOG"") so the test name "image pull
should not fall back to OCI artifact on network error" reliably fails/passes
under CI and references the same CRIO_LOG variable.

---

Nitpick comments:
In `@internal/storage/image.go`:
- Around line 833-847: The shouldTryArtifact function should also skip artifact
fallback for context cancellation/deadline errors; update the switch to check
for context.Canceled and context.DeadlineExceeded (e.g. using errors.Is(err,
context.Canceled) || errors.Is(err, context.DeadlineExceeded)) and return false
in that case, so that shouldTryArtifact returns false when the pull failed due
to a cancelled or timed-out context rather than only for net.Error cases.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86384d0 and e6ad63b.

📒 Files selected for processing (2)
  • internal/storage/image.go
  • test/image.bats

Comment thread test/image.bats

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/storage/image.go (1)

833-887: ⚠️ Potential issue | 🟡 Minor

Pass ctx into shouldTryArtifact to honor cancellation early.

Right now a canceled pull can still trigger the artifact fallback/logging path. Thread ctx through and short‑circuit on ctx.Err() to avoid extra work and noise.

💡 Suggested change
-func shouldTryArtifact(err error) bool {
+func shouldTryArtifact(ctx context.Context, err error) bool {
 	var netErr net.Error
 	switch {
 	case err == nil:
 		return false
+	case ctx != nil && ctx.Err() != nil:
+		return false
 	case errors.As(err, &netErr):
 		// Network errors indicate the registry is unreachable, so an artifact
 		// pull would fail the same way. Rely on Kubelet retries instead.
 		return false
 	default:
 		return true
 	}
 }
@@
-	if shouldTryArtifact(err) {
+	if shouldTryArtifact(ctx, err) {

As per coding guidelines: Propagate context.Context through function calls in Go code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 833 - 887, The fallback logic may run
after cancellation because shouldTryArtifact doesn't observe ctx; change
shouldTryArtifact to accept a context.Context (func shouldTryArtifact(ctx
context.Context, err error) bool) and make it return false immediately if
ctx.Err() != nil, then update the call site in pullImageImplementation (and any
other callers) to pass the current ctx (replace shouldTryArtifact(err) with
shouldTryArtifact(ctx, err)); keep the existing net.Error check and other
behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/storage/image.go`:
- Around line 833-887: The fallback logic may run after cancellation because
shouldTryArtifact doesn't observe ctx; change shouldTryArtifact to accept a
context.Context (func shouldTryArtifact(ctx context.Context, err error) bool)
and make it return false immediately if ctx.Err() != nil, then update the call
site in pullImageImplementation (and any other callers) to pass the current ctx
(replace shouldTryArtifact(err) with shouldTryArtifact(ctx, err)); keep the
existing net.Error check and other behavior unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6ad63b and eebac9b.

📒 Files selected for processing (3)
  • internal/storage/image.go
  • internal/storage/image_test.go
  • test/image.bats
💤 Files with no reviewable changes (1)
  • internal/storage/image_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/image.bats

Comment thread internal/storage/image.go Outdated
@codecov

codecov Bot commented Feb 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.44%. Comparing base (34f5b85) to head (1e19a72).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9778      +/-   ##
==========================================
+ Coverage   64.36%   64.44%   +0.07%     
==========================================
  Files         210      210              
  Lines       29141    29162      +21     
==========================================
+ Hits        18756    18792      +36     
+ Misses       8692     8681      -11     
+ Partials     1693     1689       -4     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/storage/image.go`:
- Around line 918-929: In shouldTryArtifact(err error) detect and short-circuit
context cancellation/errors first: check errors.Is(err, context.Canceled) and
errors.Is(err, context.DeadlineExceeded) and return false before testing for
net.Error; this ensures context cancellations and deadline exceeded cases do not
fallthrough to artifact fallback and still preserves the existing errors.As(err,
&netErr) logic and the default return true behavior.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eebac9b and 8b32812.

📒 Files selected for processing (3)
  • internal/storage/image.go
  • internal/storage/image_test.go
  • test/image.bats
💤 Files with no reviewable changes (1)
  • internal/storage/image_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/image.bats

Comment thread internal/storage/image.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/storage/image.go (1)

916-933: shouldTryArtifact looks correct.

The context.Canceled / context.DeadlineExceeded short-circuit at line 923 is properly in place, the errors.As net-error check covers all wrapped network failures, and the "why" comment at lines 927–929 is clear. The function is defined after its call site, consistent with Go convention and the prior nit being acted on.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 916 - 933, No changes required: the
shouldTryArtifact function correctly handles nil, context.Canceled,
context.DeadlineExceeded, and wrapped network errors via errors.As, so leave
shouldTryArtifact as-is (no code edits needed) and mark the review as approved
for function shouldTryArtifact.
🧹 Nitpick comments (1)
internal/storage/image.go (1)

873-876: Consider including the original copy error in the store-creation failure path.

Line 875 wraps only artifactErr, while line 888 wraps both the original copy.Image error and the artifact error. If the store can't be created, the caller loses the reason the initial image pull failed, making it harder to triage.

♻️ Proposed fix
-		return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: create store err: %w", artifactErr)
+		return RegistryImageReference{}, fmt.Errorf("unable to pull image or OCI artifact: pull image err: %w; create store err: %w", err, artifactErr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/storage/image.go` around lines 873 - 876, The store-creation failure
currently returns only artifactErr from ociartifact.NewStore, which loses the
original copy.Image error; update the error return in the ociartifact.NewStore
failure path to include the earlier image copy error (the variable produced by
copy.Image / the image pull call) alongside artifactErr so callers see both
causes (e.g., mention the copy error and the store creation error in the
fmt.Errorf message), referencing ociartifact.NewStore, artifactErr and the
earlier copy/Image error variable to locate and fix the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/storage/image.go`:
- Around line 916-933: No changes required: the shouldTryArtifact function
correctly handles nil, context.Canceled, context.DeadlineExceeded, and wrapped
network errors via errors.As, so leave shouldTryArtifact as-is (no code edits
needed) and mark the review as approved for function shouldTryArtifact.

---

Nitpick comments:
In `@internal/storage/image.go`:
- Around line 873-876: The store-creation failure currently returns only
artifactErr from ociartifact.NewStore, which loses the original copy.Image
error; update the error return in the ociartifact.NewStore failure path to
include the earlier image copy error (the variable produced by copy.Image / the
image pull call) alongside artifactErr so callers see both causes (e.g., mention
the copy error and the store creation error in the fmt.Errorf message),
referencing ociartifact.NewStore, artifactErr and the earlier copy/Image error
variable to locate and fix the code.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b32812 and 4a7c006.

📒 Files selected for processing (3)
  • internal/storage/image.go
  • internal/storage/image_test.go
  • test/image.bats
💤 Files with no reviewable changes (1)
  • internal/storage/image_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/image.bats

@bitoku bitoku changed the title Skip OCI artifact fallback on network errors Skip OCI artifact fallback on retryable errors Feb 25, 2026
Comment thread internal/storage/image.go Outdated
switch {
case err == nil:
return false
case retry.IsErrorRetryable(err):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is documented as unstable. The function explicitly says "That heuristic is NOT STABLE and it CAN CHANGE AT ANY TIME." Using it as a gate for the artifact fallback means the artifact fallback behavior is coupled to an unstable heuristic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I fixed it only skip fall back when there's network timeout, and other context errors.

@haircommander

Copy link
Copy Markdown
Member

/retest
/approve

LGTM

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2026
@bitoku

bitoku commented Feb 26, 2026

Copy link
Copy Markdown
Contributor Author

@saschagrunert Can you PTAL?

Comment thread internal/storage/image.go
Comment thread internal/storage/image.go Outdated
switch {
case err == nil:
return false
case retry.IsErrorRetryable(err):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is documented as unstable. The function explicitly says "That heuristic is NOT STABLE and it CAN CHANGE AT ANY TIME." Using it as a gate for the artifact fallback means the artifact fallback behavior is coupled to an unstable heuristic.

Comment thread internal/storage/image.go Outdated
if netError.Timeout() {
// Network errors are transient; retry as image pull instead of falling back.
return false
}

@saschagrunert saschagrunert Feb 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-timeout net.Error (e.g. ECONNREFUSED, ENETUNREACH, EHOSTUNREACH) falls through to return true, so the artifact fallback still runs against the same unreachable registry. Consider returning false for all net.Error matches, they are equally transient and the artifact pull would fail the same way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I fixed it.

Comment thread test/image.bats
@test "image pull should not fall back to OCI artifact on network error" {
start_crio

# 192.0.2.1 is TEST-NET-1 (RFC 5737), guaranteed unreachable

@saschagrunert saschagrunert Feb 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes 192.0.2.1 produces a net.Error with Timeout() == true (silently dropped packets). Depending on firewall/routing rules, it may return EHOSTUNREACH or ENETUNREACH instead (Timeout() == false), causing shouldTryArtifact to return true and the fallback to run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah my initial thought was that it can be imperfect because it just ends up with another pull.
And with #9782 CRI-O won't pull images as artifacts accidentally.
But I added the check, that if the error is net.Error, fail the whole pull.

Comment thread internal/storage/image.go
return references.RegistryImageReferenceFromRaw(canonicalRef), nil
}

if err != nil {

@saschagrunert saschagrunert Feb 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: other error paths in this function use "unable to pull image..." this one says "unable to copy image".

Comment thread internal/storage/image.go
Comment on lines +920 to +926
var netError net.Error
switch {
case err == nil:
return false
case errors.Is(err, context.Canceled), errors.Is(err, context.DeadlineExceeded):
// Caller-initiated cancellation or timeout; no point retrying as artifact.
return false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Is(err, context.DeadlineExceeded) is technically redundant with errors.As(err, &netError) since Go's deadlineExceededError satisfies net.Error. But keeping it explicit alongside context.Canceled is fine for readability and intent.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@openshift-ci

openshift-ci Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, haircommander, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bitoku

bitoku commented Feb 26, 2026

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@bitoku

bitoku commented Feb 27, 2026

Copy link
Copy Markdown
Contributor Author

/retest

Use net.Error timeout detection to skip the OCI artifact fallback for
transient network timeouts that should be retried as image pulls
instead. For all other errors, optimistically fall back to an artifact
pull since the image reference may actually be an artifact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026
@bitoku

bitoku commented Feb 27, 2026

Copy link
Copy Markdown
Contributor Author

@saschagrunert
Rebase removed lgtm, Can you tag it again?

@haircommander

Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 1e31373 into cri-o:main Feb 27, 2026
107 of 114 checks passed
@openshift-cherrypick-robot

Copy link
Copy Markdown

@bitoku: #9778 failed to apply on top of branch "release-1.35":

Applying: Skip OCI artifact fallback on transient network errors
Using index info to reconstruct a base tree...
M	internal/storage/image_test.go
M	test/image.bats
Falling back to patching base and 3-way merge...
Auto-merging test/image.bats
CONFLICT (content): Merge conflict in test/image.bats
Auto-merging internal/storage/image_test.go
CONFLICT (content): Merge conflict in internal/storage/image_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Skip OCI artifact fallback on transient network errors

Details

In response to this:

/cherry-pick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Feb 27, 2026

Copy link
Copy Markdown
Contributor

@bitoku: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-fedora-integration 1e19a72 link unknown /test ci-fedora-integration
ci/prow/e2e-aws-ovn 1e19a72 link unknown /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants