Skip to content

OCPBUGS-81547: oci: wait for exit file before defaulting to exit code 255#9846

Merged
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
bitoku:fix/9840
Apr 8, 2026
Merged

OCPBUGS-81547: oci: wait for exit file before defaulting to exit code 255#9846
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
bitoku:fix/9840

Conversation

@bitoku

@bitoku bitoku commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a container exits very quickly, the OCI runtime may clean
up its state before CRI-O queries it. In this case, runtimeCmd("state")
fails and CRI-O falls back to reading the exit file. However, conmon
may not have written the exit file yet, causing CRI-O to immediately
default to exit code 255 with no retry.

Extract the existing exponential backoff wait for the exit file into a
shared waitForExitFile() function and use it in both code paths:

  • Path A: when runtimeCmd("state") fails and the exit file is missing
  • Path B: when runtimeCmd("state") reports the container as stopped

This ensures fast-exiting containers get the correct exit code from
conmon's exit file instead of a spurious 255.

Which issue(s) this PR fixes:

Fixes #9840

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed the race condition where cri-o reports exitCode 255 when the container exits fast.

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability by waiting for container exit markers when runtime state queries fail, and clarifying related logging.
    • Ensures exit time and exit code fallback to 255 when exit status cannot be determined after retries.
  • Tests

    • Added tests validating successful detection from delayed exit markers and fallback-to-255 when none appear.

@bitoku bitoku requested a review from mrunalp as a code owner March 27, 2026 17:48
@openshift-ci openshift-ci Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Mar 27, 2026
@coderabbitai

coderabbitai Bot commented Mar 27, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Wait for container exit file using exponential backoff when OCI runtime "state" fails; if the file never appears, set Finished time and default ExitCode to 255. Added tests covering both exit-file-found and timeout paths.

Changes

Cohort / File(s) Summary
Runtime exit-file logic
internal/oci/runtime_oci.go
Added package-private waitForExitFile(c *Container) error (exponential backoff polling). Refactored UpdateContainerStatus to wait for the exit file after runtime "state" failures, set Finished/ExitCode=255 on timeout, and adjusted logging and lock sequencing.
Status update tests
internal/oci/runtime_oci_test.go
Added two Ginkgo tests for UpdateContainerStatus: one where an exit file appears during polling (verifies ExitCode from file) and one where no file appears (verifies fallback ExitCode 255). Added imports for test setup.

Sequence Diagram

sequenceDiagram
    participant Updater as UpdateContainerStatus
    participant OCI as OCI Runtime
    participant Lock as c.opLock
    participant FS as Exit File

    Updater->>OCI: runtimeCmd("state", ...)
    OCI--xUpdater: returns error
    Updater->>Updater: set status = ContainerStateStopped
    Updater->>Lock: unlock c.opLock
    Updater->>FS: waitForExitFile() (exponential backoff polling)
    alt exit file appears
        FS-->>Updater: file found
        Updater->>FS: read exit file
        FS-->>Updater: exit code (e.g., 0)
        Updater->>Updater: set Finished & ExitCode from file
    else timeout
        FS--xUpdater: file not found after retries
        Updater->>Updater: set Finished=time.Now(), ExitCode=255
    end
    Updater->>Lock: re-acquire c.opLock
    Updater->>OCI: retry runtimeCmd("state", ...)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • mrunalp
  • klihub
  • hasan4791

Poem

🐰
I nibbled at the timeout, paws tapping like a clock,
I checked the door with gentle hops until the file did knock.
If none arrived I marked the end and penciled two-five-five,
But when it showed a zero—hop!—the rabbit's song came alive.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'OCPBUGS-81547: oci: wait for exit file before defaulting to exit code 255' clearly and specifically summarizes the main change: adding a wait mechanism for the exit file before using a default exit code.
Linked Issues check ✅ Passed The pull request implements all coding requirements from #9840: extracts a waitForExitFile() helper with exponential backoff, applies it to both runtime state failure and stopped container paths, and adds tests covering both successful and failed exit-file scenarios.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the race condition in #9840: modified exit-file waiting logic, added supporting helper function, and comprehensive test coverage for the fix scenarios.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from hasan4791 and klihub March 27, 2026 17:48
@bitoku bitoku force-pushed the fix/9840 branch 2 times, most recently from 90ac4ba to 6ddedbd Compare March 27, 2026 17:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
internal/oci/runtime_oci.go (1)

1217-1231: Consider releasing opLock before waiting for exit file in the error path.

In this error path, waitForExitFile can block for up to ~5 seconds while holding c.opLock. This differs from the pattern at lines 1260-1262 where the lock is explicitly released before waiting.

Holding the lock during the wait could:

  1. Block concurrent ContainerStatus CRI calls (context snippet 2)
  2. Block ContainerStateToDisk operations (context snippet 4)
  3. Delay pod status checks in Kubernetes

However, since the container state is already being set to ContainerStateStopped (line 1216) before the wait, releasing the lock could expose partially updated state. The current approach prioritizes consistency over concurrency.

💡 Optional: Release lock before waiting (mirrors lines 1260-1262 pattern)

If concurrency is preferred over atomicity here, consider:

 			c.state.Status = ContainerStateStopped
 
-			// The exit file may not exist yet if conmon hasn't written it.
-			// Wait for it to appear before defaulting to exit code 255.
-			// This prevents a race where fast-exiting containers
-			// get a spurious exit code 255.
-			waitErr := waitForExitFile(c)
+			c.opLock.Unlock()
+			waitErr := waitForExitFile(c)
+			c.opLock.Lock()
+
 			if waitErr != nil {

This would require careful review of the state consistency implications.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/runtime_oci.go` around lines 1217 - 1231, The waitForExitFile
call holds c.opLock for up to ~5s causing unwanted blocking; change the
error-path to release the lock before waiting and re-acquire it before mutating
container state—i.e., unlock c.opLock (matching the unlock pattern used around
lines 1260–1262), call waitForExitFile(c), then lock c.opLock again and only
then set c.state.Finished and c.state.ExitCode or call
updateContainerStatusFromExitFile(c); ensure you re-acquire the same lock used
elsewhere so state updates remain synchronized and no other goroutine can
concurrently mutate c.state.
internal/oci/runtime_oci_test.go (2)

263-263: Defer cleanup before mock setup to ensure cleanup on test failure.

The defer cmdrunner.ResetPrependedCmd() statements (lines 263 and 323) are placed after cmdrunner.SetMocked(runner). If any setup between SetMocked and defer fails, the mock won't be reset. Consider moving the defer immediately after setting the mock.

💡 Move defer immediately after SetMocked
 		runner := runnerMock.NewMockCommandRunner(mockCtrl)
 		cmdrunner.SetMocked(runner)
-
 		defer cmdrunner.ResetPrependedCmd()
+
 		runner.EXPECT().Command(gomock.Any(), gomock.Any()).Return(

Also applies to: 323-323

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/runtime_oci_test.go` at line 263, Move each defer
cmdrunner.ResetPrependedCmd() to immediately follow the corresponding
cmdrunner.SetMocked(runner) call so the mock is always reset even if subsequent
setup fails; locate the two occurrences where SetMocked(runner) is called in
internal/oci/runtime_oci_test.go and place defer cmdrunner.ResetPrependedCmd()
directly after each SetMocked invocation (before any further setup or
assertions) to ensure cleanup on test failure.

238-298: Good test coverage for the fix. Minor suggestions for robustness.

The test correctly validates that UpdateContainerStatus waits for the exit file and reads exit code 0.

Two minor suggestions:

  1. Use filepath.Join for path construction (line 274): Consistent with how exitFilePath() constructs the path in production code.

  2. Potential test flakiness: The 800ms delay relies on the backoff timing. If the first retry happens faster than expected (e.g., due to scheduling), the test could still pass; if slower, it could flake. Consider using a channel or file watcher to signal readiness instead of a fixed sleep.

💡 Use filepath.Join for consistency
+		exitFilePath := filepath.Join(containerDir, "exit")
+
 		// Write the exit file after a short delay, simulating conmon
 		// writing it after the first read attempt fails.
 		go func() {
 			time.Sleep(800 * time.Millisecond)
 			Expect(os.WriteFile(
-				containerDir+"/exit", []byte("0"), 0o644,
+				exitFilePath, []byte("0"), 0o644,
 			)).To(Succeed())
 		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/runtime_oci_test.go` around lines 238 - 298, Replace the
hard-coded path construction containerDir+"/exit" with
filepath.Join(containerDir, "exit") to match exitFilePath() and avoid OS path
issues, and remove the fixed time.Sleep delay: instead use a signaling channel
to deterministically control when the test goroutine writes the exit file
(create a chan struct{}, have the goroutine block on receiving from it before
calling os.WriteFile(filepath.Join(containerDir, "exit"), ...), and signal the
channel from the test at the appropriate point around the call to
runtime.UpdateContainerStatus); refer to variables and functions containerDir,
os.WriteFile, filepath.Join, runtime.UpdateContainerStatus and sut.State() to
locate where to change the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/oci/runtime_oci_test.go`:
- Line 263: Move each defer cmdrunner.ResetPrependedCmd() to immediately follow
the corresponding cmdrunner.SetMocked(runner) call so the mock is always reset
even if subsequent setup fails; locate the two occurrences where
SetMocked(runner) is called in internal/oci/runtime_oci_test.go and place defer
cmdrunner.ResetPrependedCmd() directly after each SetMocked invocation (before
any further setup or assertions) to ensure cleanup on test failure.
- Around line 238-298: Replace the hard-coded path construction
containerDir+"/exit" with filepath.Join(containerDir, "exit") to match
exitFilePath() and avoid OS path issues, and remove the fixed time.Sleep delay:
instead use a signaling channel to deterministically control when the test
goroutine writes the exit file (create a chan struct{}, have the goroutine block
on receiving from it before calling os.WriteFile(filepath.Join(containerDir,
"exit"), ...), and signal the channel from the test at the appropriate point
around the call to runtime.UpdateContainerStatus); refer to variables and
functions containerDir, os.WriteFile, filepath.Join,
runtime.UpdateContainerStatus and sut.State() to locate where to change the
test.

In `@internal/oci/runtime_oci.go`:
- Around line 1217-1231: The waitForExitFile call holds c.opLock for up to ~5s
causing unwanted blocking; change the error-path to release the lock before
waiting and re-acquire it before mutating container state—i.e., unlock c.opLock
(matching the unlock pattern used around lines 1260–1262), call
waitForExitFile(c), then lock c.opLock again and only then set c.state.Finished
and c.state.ExitCode or call updateContainerStatusFromExitFile(c); ensure you
re-acquire the same lock used elsewhere so state updates remain synchronized and
no other goroutine can concurrently mutate c.state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5dd45d10-a907-48d5-992a-fed66c211ef9

📥 Commits

Reviewing files that changed from the base of the PR and between 5753c52 and 6ddedbd.

📒 Files selected for processing (2)
  • internal/oci/runtime_oci.go
  • internal/oci/runtime_oci_test.go

@codecov

codecov Bot commented Mar 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.77%. Comparing base (848f93a) to head (88e8f42).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9846      +/-   ##
==========================================
+ Coverage   67.61%   67.77%   +0.15%     
==========================================
  Files         212      212              
  Lines       29366    29386      +20     
==========================================
+ Hits        19856    19915      +59     
+ Misses       7814     7771      -43     
- Partials     1696     1700       +4     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bitoku bitoku changed the title oci: wait for exit file before defaulting to exit code 255 OCPBUGS-81547: oci: wait for exit file before defaulting to exit code 255 Apr 7, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 7, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-81547, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a container exits very quickly, the OCI runtime may clean
up its state before CRI-O queries it. In this case, runtimeCmd("state")
fails and CRI-O falls back to reading the exit file. However, conmon
may not have written the exit file yet, causing CRI-O to immediately
default to exit code 255 with no retry.

Extract the existing exponential backoff wait for the exit file into a
shared waitForExitFile() function and use it in both code paths:

  • Path A: when runtimeCmd("state") fails and the exit file is missing
  • Path B: when runtimeCmd("state") reports the container as stopped

This ensures fast-exiting containers get the correct exit code from
conmon's exit file instead of a spurious 255.

Which issue(s) this PR fixes:

Fixes #9840

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed the race condition where cri-o reports exitCode 255 when the container exits fast.

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced container exit status detection reliability

  • Exit codes now properly default to 255 when status verification fails

  • Tests

  • Added test coverage for exit code derivation scenarios

  • Tests verify both successful and failed status verification cases

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread internal/oci/runtime_oci.go Outdated
// waitForExitFile waits for the container's exit file to appear using
// exponential backoff. Returns nil if the file appeared, or an error if it did
// not appear within the timeout (~5s total: 500ms initial, 1.2 factor, 6 steps).
func waitForExitFile(c *Container) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: define below called

Comment thread internal/oci/runtime_oci.go Outdated
// get a spurious exit code 255.
waitErr := waitForExitFile(c)
if waitErr != nil {
log.Errorf(ctx, "Failed to update container status from exit file for %s: %v", c.ID(), err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to use waitErr here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Comment thread internal/oci/runtime_oci_test.go Outdated

// Write the exit file after a short delay, simulating conmon
// writing it after the first read attempt fails.
go func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not spawn this closer to UpdateContainerStatus? we probably won't go through 800 ms before it's started but if we did that could be a false positive. unlikely, but just in case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds better. tbh I didn't think about that case.

@bitoku bitoku force-pushed the fix/9840 branch 2 times, most recently from 54244a7 to b7159be Compare April 8, 2026 08:33
When a container exits very quickly, the OCI runtime may clean
up its state before CRI-O queries it. In this case, runtimeCmd("state")
fails and CRI-O falls back to reading the exit file. However, conmon
may not have written the exit file yet, causing CRI-O to immediately
default to exit code 255 with no retry.

Extract the existing exponential backoff wait for the exit file into a
shared waitForExitFile() function and use it in both code paths:
- Path A: when runtimeCmd("state") fails and the exit file is missing
- Path B: when runtimeCmd("state") reports the container as stopped

This ensures fast-exiting containers get the correct exit code from
conmon's exit file instead of a spurious 255.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Ayato Tokubi <atokubi@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/oci/runtime_oci.go (1)

1166-1186: Consider accepting context.Context for cancellation support.

The function can block for up to ~5 seconds without being cancellable. If the upstream context is cancelled (e.g., during shutdown), this function will continue polling. While the existing file operations in this codebase follow similar patterns, adding context support would improve responsiveness.

As per coding guidelines: "Propagate context.Context through function calls in Go code."

♻️ Optional: Add context support using `kwait.ExponentialBackoffWithContext`
-// waitForExitFile waits for the container's exit file to appear using
-// exponential backoff. Returns nil if the file appeared, or an error if it did
-// not appear within the timeout (~5s total: 500ms initial, 1.2 factor, 6 steps).
-func waitForExitFile(c *Container) error {
+// waitForExitFile waits for the container's exit file to appear using
+// exponential backoff. Returns nil if the file appeared, or an error if it did
+// not appear within the timeout (~5s total: 500ms initial, 1.2 factor, 6 steps).
+func waitForExitFile(ctx context.Context, c *Container) error {
 	exitFilePath := c.exitFilePath()
 
-	return kwait.ExponentialBackoff(
+	return kwait.ExponentialBackoffWithContext(
+		ctx,
 		kwait.Backoff{
 			Duration: 500 * time.Millisecond,
 			Factor:   1.2,
 			Steps:    6,
 		},
 		func() (bool, error) {

Note: This would require updating call sites to pass context.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/runtime_oci.go` around lines 1166 - 1186, Update waitForExitFile
to accept a context.Context and use kwait.ExponentialBackoffWithContext so the
polling can be cancelled; specifically change the signature of waitForExitFile(c
*Container) error to waitForExitFile(ctx context.Context, c *Container) error,
replace kwait.ExponentialBackoff(...) with
kwait.ExponentialBackoffWithContext(ctx, ...), and propagate the context through
the lambda that calls os.Stat(exitFilePath). Finally, update all call sites of
waitForExitFile to pass the appropriate context (e.g., the upstream context or
context.Background() where cancellation is not available).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/oci/runtime_oci.go`:
- Around line 1166-1186: Update waitForExitFile to accept a context.Context and
use kwait.ExponentialBackoffWithContext so the polling can be cancelled;
specifically change the signature of waitForExitFile(c *Container) error to
waitForExitFile(ctx context.Context, c *Container) error, replace
kwait.ExponentialBackoff(...) with kwait.ExponentialBackoffWithContext(ctx,
...), and propagate the context through the lambda that calls
os.Stat(exitFilePath). Finally, update all call sites of waitForExitFile to pass
the appropriate context (e.g., the upstream context or context.Background()
where cancellation is not available).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19dbc3da-fc99-4859-855d-c426e9dfa480

📥 Commits

Reviewing files that changed from the base of the PR and between 6ddedbd and 54244a7.

📒 Files selected for processing (2)
  • internal/oci/runtime_oci.go
  • internal/oci/runtime_oci_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/oci/runtime_oci_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/oci/runtime_oci.go (1)

1283-1303: Consider accepting context.Context for cancellation support.

The function doesn't currently support cancellation. While kwait.ExponentialBackoff doesn't natively support context, you could use kwait.ExponentialBackoffWithContext or check ctx.Done() within the condition function to allow callers to abort the wait early. This would align with the coding guideline to propagate context through function calls.

That said, given the bounded ~5s maximum wait and the primary goal of fixing the race condition, this is a nice-to-have improvement rather than a blocker.

🔧 Optional: Add context support
-func waitForExitFile(c *Container) error {
+func waitForExitFile(ctx context.Context, c *Container) error {
 	exitFilePath := c.exitFilePath()
 
 	return kwait.ExponentialBackoff(
 		kwait.Backoff{
 			Duration: 500 * time.Millisecond,
 			Factor:   1.2,
 			Steps:    6,
 		},
 		func() (bool, error) {
+			select {
+			case <-ctx.Done():
+				return false, ctx.Err()
+			default:
+			}
 			_, err := os.Stat(exitFilePath)
 			if err != nil {
 				return false, nil
 			}
 
 			return true, nil
 		})
 }

As per coding guidelines: "Propagate context.Context through function calls in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/runtime_oci.go` around lines 1283 - 1303, The waitForExitFile
function lacks context cancellation support; update its signature to accept ctx
context.Context and use kwait.ExponentialBackoffWithContext (or pass a condition
that checks ctx.Done()) so callers can cancel the wait early: change func
waitForExitFile(c *Container) error to func waitForExitFile(ctx context.Context,
c *Container) error, pass ctx into the backoff call (or return early when ctx is
done inside the closure), and ensure all callers are updated to supply a
context.
internal/oci/runtime_oci_test.go (1)

281-286: Use filepath.Join for constructing the exit file path.

While string concatenation works on Linux, filepath.Join is the idiomatic approach for path construction in Go.

🔧 Suggested change
+		exitFile := filepath.Join(containerDir, "exit")
 		go func() {
 			time.Sleep(800 * time.Millisecond)
 			Expect(os.WriteFile(
-				containerDir+"/exit", []byte("0"), 0o644,
+				exitFile, []byte("0"), 0o644,
 			)).To(Succeed())
 		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/runtime_oci_test.go` around lines 281 - 286, Replace the
string-concatenated path containerDir+"/exit" with filepath.Join(containerDir,
"exit") in the goroutine that writes the exit file (the anonymous goroutine in
runtime_oci_test.go), and add an import for "path/filepath" if it's not already
imported so path construction is platform-idiomatic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/oci/runtime_oci_test.go`:
- Around line 281-286: Replace the string-concatenated path containerDir+"/exit"
with filepath.Join(containerDir, "exit") in the goroutine that writes the exit
file (the anonymous goroutine in runtime_oci_test.go), and add an import for
"path/filepath" if it's not already imported so path construction is
platform-idiomatic.

In `@internal/oci/runtime_oci.go`:
- Around line 1283-1303: The waitForExitFile function lacks context cancellation
support; update its signature to accept ctx context.Context and use
kwait.ExponentialBackoffWithContext (or pass a condition that checks ctx.Done())
so callers can cancel the wait early: change func waitForExitFile(c *Container)
error to func waitForExitFile(ctx context.Context, c *Container) error, pass ctx
into the backoff call (or return early when ctx is done inside the closure), and
ensure all callers are updated to supply a context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3cc65530-8f6a-4ff2-9ee2-020499ddf66e

📥 Commits

Reviewing files that changed from the base of the PR and between 54244a7 and 88e8f42.

📒 Files selected for processing (2)
  • internal/oci/runtime_oci.go
  • internal/oci/runtime_oci_test.go

@bitoku

bitoku commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-81547, which is invalid:

  • expected the bug to target only the "4.22.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku

bitoku commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 8, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-81547, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku

bitoku commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-1.35 release-1.34 release-1.33

@openshift-cherrypick-robot

Copy link
Copy Markdown

@bitoku: once the present PR merges, I will cherry-pick it on top of release-1.35 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-1.35 release-1.34 release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bitoku

bitoku commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

@cri-o/cri-o-maintainers PTAL

@haircommander

Copy link
Copy Markdown
Member

/approve
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2026
@openshift-ci

openshift-ci Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 2ebd656 into cri-o:main Apr 8, 2026
71 of 74 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@bitoku: Jira Issue OCPBUGS-81547: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-81547 has been moved to the MODIFIED state.

Details

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a container exits very quickly, the OCI runtime may clean
up its state before CRI-O queries it. In this case, runtimeCmd("state")
fails and CRI-O falls back to reading the exit file. However, conmon
may not have written the exit file yet, causing CRI-O to immediately
default to exit code 255 with no retry.

Extract the existing exponential backoff wait for the exit file into a
shared waitForExitFile() function and use it in both code paths:

  • Path A: when runtimeCmd("state") fails and the exit file is missing
  • Path B: when runtimeCmd("state") reports the container as stopped

This ensures fast-exiting containers get the correct exit code from
conmon's exit file instead of a spurious 255.

Which issue(s) this PR fixes:

Fixes #9840

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed the race condition where cri-o reports exitCode 255 when the container exits fast.

Summary by CodeRabbit

  • Bug Fixes

  • Improved reliability by waiting for container exit markers when runtime state queries fail, and clarifying related logging.

  • Ensures exit time and exit code fallback to 255 when exit status cannot be determined after retries.

  • Tests

  • Added tests validating successful detection from delayed exit markers and fallback-to-255 when none appear.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@bitoku: new pull request created: #9871

Details

In response to this:

/cherry-pick release-1.35 release-1.34 release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRI-O reports exit code 255 for fast-exiting init containers in production (conmon race condition)

4 participants