OCPBUGS-81547: oci: wait for exit file before defaulting to exit code 255#9846
Conversation
📝 WalkthroughWalkthroughWait for container exit file using exponential backoff when OCI runtime "state" fails; if the file never appears, set Finished time and default ExitCode to 255. Added tests covering both exit-file-found and timeout paths. Changes
Sequence DiagramsequenceDiagram
participant Updater as UpdateContainerStatus
participant OCI as OCI Runtime
participant Lock as c.opLock
participant FS as Exit File
Updater->>OCI: runtimeCmd("state", ...)
OCI--xUpdater: returns error
Updater->>Updater: set status = ContainerStateStopped
Updater->>Lock: unlock c.opLock
Updater->>FS: waitForExitFile() (exponential backoff polling)
alt exit file appears
FS-->>Updater: file found
Updater->>FS: read exit file
FS-->>Updater: exit code (e.g., 0)
Updater->>Updater: set Finished & ExitCode from file
else timeout
FS--xUpdater: file not found after retries
Updater->>Updater: set Finished=time.Now(), ExitCode=255
end
Updater->>Lock: re-acquire c.opLock
Updater->>OCI: retry runtimeCmd("state", ...)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
90ac4ba to
6ddedbd
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
internal/oci/runtime_oci.go (1)
1217-1231: Consider releasing opLock before waiting for exit file in the error path.In this error path,
waitForExitFilecan block for up to ~5 seconds while holdingc.opLock. This differs from the pattern at lines 1260-1262 where the lock is explicitly released before waiting.Holding the lock during the wait could:
- Block concurrent
ContainerStatusCRI calls (context snippet 2)- Block
ContainerStateToDiskoperations (context snippet 4)- Delay pod status checks in Kubernetes
However, since the container state is already being set to
ContainerStateStopped(line 1216) before the wait, releasing the lock could expose partially updated state. The current approach prioritizes consistency over concurrency.💡 Optional: Release lock before waiting (mirrors lines 1260-1262 pattern)
If concurrency is preferred over atomicity here, consider:
c.state.Status = ContainerStateStopped - // The exit file may not exist yet if conmon hasn't written it. - // Wait for it to appear before defaulting to exit code 255. - // This prevents a race where fast-exiting containers - // get a spurious exit code 255. - waitErr := waitForExitFile(c) + c.opLock.Unlock() + waitErr := waitForExitFile(c) + c.opLock.Lock() + if waitErr != nil {This would require careful review of the state consistency implications.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/runtime_oci.go` around lines 1217 - 1231, The waitForExitFile call holds c.opLock for up to ~5s causing unwanted blocking; change the error-path to release the lock before waiting and re-acquire it before mutating container state—i.e., unlock c.opLock (matching the unlock pattern used around lines 1260–1262), call waitForExitFile(c), then lock c.opLock again and only then set c.state.Finished and c.state.ExitCode or call updateContainerStatusFromExitFile(c); ensure you re-acquire the same lock used elsewhere so state updates remain synchronized and no other goroutine can concurrently mutate c.state.internal/oci/runtime_oci_test.go (2)
263-263: Defer cleanup before mock setup to ensure cleanup on test failure.The
defer cmdrunner.ResetPrependedCmd()statements (lines 263 and 323) are placed aftercmdrunner.SetMocked(runner). If any setup betweenSetMockedanddeferfails, the mock won't be reset. Consider moving the defer immediately after setting the mock.💡 Move defer immediately after SetMocked
runner := runnerMock.NewMockCommandRunner(mockCtrl) cmdrunner.SetMocked(runner) - defer cmdrunner.ResetPrependedCmd() + runner.EXPECT().Command(gomock.Any(), gomock.Any()).Return(Also applies to: 323-323
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/runtime_oci_test.go` at line 263, Move each defer cmdrunner.ResetPrependedCmd() to immediately follow the corresponding cmdrunner.SetMocked(runner) call so the mock is always reset even if subsequent setup fails; locate the two occurrences where SetMocked(runner) is called in internal/oci/runtime_oci_test.go and place defer cmdrunner.ResetPrependedCmd() directly after each SetMocked invocation (before any further setup or assertions) to ensure cleanup on test failure.
238-298: Good test coverage for the fix. Minor suggestions for robustness.The test correctly validates that
UpdateContainerStatuswaits for the exit file and reads exit code 0.Two minor suggestions:
Use
filepath.Joinfor path construction (line 274): Consistent with howexitFilePath()constructs the path in production code.Potential test flakiness: The 800ms delay relies on the backoff timing. If the first retry happens faster than expected (e.g., due to scheduling), the test could still pass; if slower, it could flake. Consider using a channel or file watcher to signal readiness instead of a fixed sleep.
💡 Use filepath.Join for consistency
+ exitFilePath := filepath.Join(containerDir, "exit") + // Write the exit file after a short delay, simulating conmon // writing it after the first read attempt fails. go func() { time.Sleep(800 * time.Millisecond) Expect(os.WriteFile( - containerDir+"/exit", []byte("0"), 0o644, + exitFilePath, []byte("0"), 0o644, )).To(Succeed()) }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/runtime_oci_test.go` around lines 238 - 298, Replace the hard-coded path construction containerDir+"/exit" with filepath.Join(containerDir, "exit") to match exitFilePath() and avoid OS path issues, and remove the fixed time.Sleep delay: instead use a signaling channel to deterministically control when the test goroutine writes the exit file (create a chan struct{}, have the goroutine block on receiving from it before calling os.WriteFile(filepath.Join(containerDir, "exit"), ...), and signal the channel from the test at the appropriate point around the call to runtime.UpdateContainerStatus); refer to variables and functions containerDir, os.WriteFile, filepath.Join, runtime.UpdateContainerStatus and sut.State() to locate where to change the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/oci/runtime_oci_test.go`:
- Line 263: Move each defer cmdrunner.ResetPrependedCmd() to immediately follow
the corresponding cmdrunner.SetMocked(runner) call so the mock is always reset
even if subsequent setup fails; locate the two occurrences where
SetMocked(runner) is called in internal/oci/runtime_oci_test.go and place defer
cmdrunner.ResetPrependedCmd() directly after each SetMocked invocation (before
any further setup or assertions) to ensure cleanup on test failure.
- Around line 238-298: Replace the hard-coded path construction
containerDir+"/exit" with filepath.Join(containerDir, "exit") to match
exitFilePath() and avoid OS path issues, and remove the fixed time.Sleep delay:
instead use a signaling channel to deterministically control when the test
goroutine writes the exit file (create a chan struct{}, have the goroutine block
on receiving from it before calling os.WriteFile(filepath.Join(containerDir,
"exit"), ...), and signal the channel from the test at the appropriate point
around the call to runtime.UpdateContainerStatus); refer to variables and
functions containerDir, os.WriteFile, filepath.Join,
runtime.UpdateContainerStatus and sut.State() to locate where to change the
test.
In `@internal/oci/runtime_oci.go`:
- Around line 1217-1231: The waitForExitFile call holds c.opLock for up to ~5s
causing unwanted blocking; change the error-path to release the lock before
waiting and re-acquire it before mutating container state—i.e., unlock c.opLock
(matching the unlock pattern used around lines 1260–1262), call
waitForExitFile(c), then lock c.opLock again and only then set c.state.Finished
and c.state.ExitCode or call updateContainerStatusFromExitFile(c); ensure you
re-acquire the same lock used elsewhere so state updates remain synchronized and
no other goroutine can concurrently mutate c.state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5dd45d10-a907-48d5-992a-fed66c211ef9
📒 Files selected for processing (2)
internal/oci/runtime_oci.gointernal/oci/runtime_oci_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9846 +/- ##
==========================================
+ Coverage 67.61% 67.77% +0.15%
==========================================
Files 212 212
Lines 29366 29386 +20
==========================================
+ Hits 19856 19915 +59
+ Misses 7814 7771 -43
- Partials 1696 1700 +4 🚀 New features to boost your workflow:
|
|
@bitoku: This pull request references Jira Issue OCPBUGS-81547, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| // waitForExitFile waits for the container's exit file to appear using | ||
| // exponential backoff. Returns nil if the file appeared, or an error if it did | ||
| // not appear within the timeout (~5s total: 500ms initial, 1.2 factor, 6 steps). | ||
| func waitForExitFile(c *Container) error { |
| // get a spurious exit code 255. | ||
| waitErr := waitForExitFile(c) | ||
| if waitErr != nil { | ||
| log.Errorf(ctx, "Failed to update container status from exit file for %s: %v", c.ID(), err) |
There was a problem hiding this comment.
did you mean to use waitErr here?
|
|
||
| // Write the exit file after a short delay, simulating conmon | ||
| // writing it after the first read attempt fails. | ||
| go func() { |
There was a problem hiding this comment.
why not spawn this closer to UpdateContainerStatus? we probably won't go through 800 ms before it's started but if we did that could be a false positive. unlikely, but just in case?
There was a problem hiding this comment.
that sounds better. tbh I didn't think about that case.
54244a7 to
b7159be
Compare
When a container exits very quickly, the OCI runtime may clean
up its state before CRI-O queries it. In this case, runtimeCmd("state")
fails and CRI-O falls back to reading the exit file. However, conmon
may not have written the exit file yet, causing CRI-O to immediately
default to exit code 255 with no retry.
Extract the existing exponential backoff wait for the exit file into a
shared waitForExitFile() function and use it in both code paths:
- Path A: when runtimeCmd("state") fails and the exit file is missing
- Path B: when runtimeCmd("state") reports the container as stopped
This ensures fast-exiting containers get the correct exit code from
conmon's exit file instead of a spurious 255.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/oci/runtime_oci.go (1)
1166-1186: Consider acceptingcontext.Contextfor cancellation support.The function can block for up to ~5 seconds without being cancellable. If the upstream context is cancelled (e.g., during shutdown), this function will continue polling. While the existing file operations in this codebase follow similar patterns, adding context support would improve responsiveness.
As per coding guidelines: "Propagate context.Context through function calls in Go code."
♻️ Optional: Add context support using `kwait.ExponentialBackoffWithContext`
-// waitForExitFile waits for the container's exit file to appear using -// exponential backoff. Returns nil if the file appeared, or an error if it did -// not appear within the timeout (~5s total: 500ms initial, 1.2 factor, 6 steps). -func waitForExitFile(c *Container) error { +// waitForExitFile waits for the container's exit file to appear using +// exponential backoff. Returns nil if the file appeared, or an error if it did +// not appear within the timeout (~5s total: 500ms initial, 1.2 factor, 6 steps). +func waitForExitFile(ctx context.Context, c *Container) error { exitFilePath := c.exitFilePath() - return kwait.ExponentialBackoff( + return kwait.ExponentialBackoffWithContext( + ctx, kwait.Backoff{ Duration: 500 * time.Millisecond, Factor: 1.2, Steps: 6, }, func() (bool, error) {Note: This would require updating call sites to pass context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/runtime_oci.go` around lines 1166 - 1186, Update waitForExitFile to accept a context.Context and use kwait.ExponentialBackoffWithContext so the polling can be cancelled; specifically change the signature of waitForExitFile(c *Container) error to waitForExitFile(ctx context.Context, c *Container) error, replace kwait.ExponentialBackoff(...) with kwait.ExponentialBackoffWithContext(ctx, ...), and propagate the context through the lambda that calls os.Stat(exitFilePath). Finally, update all call sites of waitForExitFile to pass the appropriate context (e.g., the upstream context or context.Background() where cancellation is not available).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/oci/runtime_oci.go`:
- Around line 1166-1186: Update waitForExitFile to accept a context.Context and
use kwait.ExponentialBackoffWithContext so the polling can be cancelled;
specifically change the signature of waitForExitFile(c *Container) error to
waitForExitFile(ctx context.Context, c *Container) error, replace
kwait.ExponentialBackoff(...) with kwait.ExponentialBackoffWithContext(ctx,
...), and propagate the context through the lambda that calls
os.Stat(exitFilePath). Finally, update all call sites of waitForExitFile to pass
the appropriate context (e.g., the upstream context or context.Background()
where cancellation is not available).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 19dbc3da-fc99-4859-855d-c426e9dfa480
📒 Files selected for processing (2)
internal/oci/runtime_oci.gointernal/oci/runtime_oci_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/oci/runtime_oci_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/oci/runtime_oci.go (1)
1283-1303: Consider acceptingcontext.Contextfor cancellation support.The function doesn't currently support cancellation. While
kwait.ExponentialBackoffdoesn't natively support context, you could usekwait.ExponentialBackoffWithContextor checkctx.Done()within the condition function to allow callers to abort the wait early. This would align with the coding guideline to propagate context through function calls.That said, given the bounded ~5s maximum wait and the primary goal of fixing the race condition, this is a nice-to-have improvement rather than a blocker.
🔧 Optional: Add context support
-func waitForExitFile(c *Container) error { +func waitForExitFile(ctx context.Context, c *Container) error { exitFilePath := c.exitFilePath() return kwait.ExponentialBackoff( kwait.Backoff{ Duration: 500 * time.Millisecond, Factor: 1.2, Steps: 6, }, func() (bool, error) { + select { + case <-ctx.Done(): + return false, ctx.Err() + default: + } _, err := os.Stat(exitFilePath) if err != nil { return false, nil } return true, nil }) }As per coding guidelines: "Propagate context.Context through function calls in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/runtime_oci.go` around lines 1283 - 1303, The waitForExitFile function lacks context cancellation support; update its signature to accept ctx context.Context and use kwait.ExponentialBackoffWithContext (or pass a condition that checks ctx.Done()) so callers can cancel the wait early: change func waitForExitFile(c *Container) error to func waitForExitFile(ctx context.Context, c *Container) error, pass ctx into the backoff call (or return early when ctx is done inside the closure), and ensure all callers are updated to supply a context.internal/oci/runtime_oci_test.go (1)
281-286: Usefilepath.Joinfor constructing the exit file path.While string concatenation works on Linux,
filepath.Joinis the idiomatic approach for path construction in Go.🔧 Suggested change
+ exitFile := filepath.Join(containerDir, "exit") go func() { time.Sleep(800 * time.Millisecond) Expect(os.WriteFile( - containerDir+"/exit", []byte("0"), 0o644, + exitFile, []byte("0"), 0o644, )).To(Succeed()) }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/runtime_oci_test.go` around lines 281 - 286, Replace the string-concatenated path containerDir+"/exit" with filepath.Join(containerDir, "exit") in the goroutine that writes the exit file (the anonymous goroutine in runtime_oci_test.go), and add an import for "path/filepath" if it's not already imported so path construction is platform-idiomatic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/oci/runtime_oci_test.go`:
- Around line 281-286: Replace the string-concatenated path containerDir+"/exit"
with filepath.Join(containerDir, "exit") in the goroutine that writes the exit
file (the anonymous goroutine in runtime_oci_test.go), and add an import for
"path/filepath" if it's not already imported so path construction is
platform-idiomatic.
In `@internal/oci/runtime_oci.go`:
- Around line 1283-1303: The waitForExitFile function lacks context cancellation
support; update its signature to accept ctx context.Context and use
kwait.ExponentialBackoffWithContext (or pass a condition that checks ctx.Done())
so callers can cancel the wait early: change func waitForExitFile(c *Container)
error to func waitForExitFile(ctx context.Context, c *Container) error, pass ctx
into the backoff call (or return early when ctx is done inside the closure), and
ensure all callers are updated to supply a context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3cc65530-8f6a-4ff2-9ee2-020499ddf66e
📒 Files selected for processing (2)
internal/oci/runtime_oci.gointernal/oci/runtime_oci_test.go
|
/jira refresh |
|
@bitoku: This pull request references Jira Issue OCPBUGS-81547, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@bitoku: This pull request references Jira Issue OCPBUGS-81547, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-1.35 release-1.34 release-1.33 |
|
@bitoku: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@cri-o/cri-o-maintainers PTAL |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bitoku, haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@bitoku: Jira Issue OCPBUGS-81547: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-81547 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bitoku: new pull request created: #9871 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When a container exits very quickly, the OCI runtime may clean
up its state before CRI-O queries it. In this case, runtimeCmd("state")
fails and CRI-O falls back to reading the exit file. However, conmon
may not have written the exit file yet, causing CRI-O to immediately
default to exit code 255 with no retry.
Extract the existing exponential backoff wait for the exit file into a
shared waitForExitFile() function and use it in both code paths:
This ensures fast-exiting containers get the correct exit code from
conmon's exit file instead of a spurious 255.
Which issue(s) this PR fixes:
Fixes #9840
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Summary by CodeRabbit
Bug Fixes
Tests