OCPBUGS-76451: fix: prevent panic on closed stopTimeoutChan in StopContainer by sabujmaity · Pull Request #9799 · cri-o/cri-o

sabujmaity · 2026-03-05T09:04:55Z

/kind bug

What this PR does / why we need it:

There is a race condition in the container stop path. When a second StopContainer call
comes in after the first one has already finished (i.e. SetAsDoneStopping has
run), WaitOnStopTimeout still gets called. At that point stopTimeoutChan
is already closed, so we panic.

The sequence is:

First StopContainer -> starts the stop loop, waits via WaitOnStopTimeout
Stop loop finishes -> SetAsDoneStopping closes stopTimeoutChan and watchers
Second StopContainer -> SetAsStopping returns false (already stopping),
falls through to WaitOnStopTimeout which tries to use the closed channel

This adds a stopDone bool (guarded by the existing stopLock) that gets set
in SetAsDoneStopping. WaitOnStopTimeout checks it and returns early if
the stop lifecycle is already done.

Which issue(s) this PR fixes:

Addresses: OCPBUGS-76451

Special notes for your reviewer:

The change is small : one new bool field, one assignment, one extra condition
check. All under the existing stopLock so no new synchronization concerns.

Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeout and confirms it
doesn't panic.

Does this PR introduce a user-facing change?

Fixed a panic when concurrent StopContainer calls race against the stop lifecycle completing.

Summary by CodeRabbit

Bug Fixes
- Prevented rare panics during container shutdown by improving coordination between stop and timeout handling, making shutdown sequences more reliable and resilient.
Tests
- Added regression tests to ensure stop/wait operations do not panic when invoked after shutdown completion, reducing regressions and increasing stability.

openshift-ci · 2026-03-05T09:05:08Z

Hi @sabujmaity. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-03-05T09:05:31Z

📝 Walkthrough

Walkthrough

Added a stopDone boolean to Container and adjusted stop coordination so WaitOnStopTimeout returns early if not stopping or stopDone is true. SetAsDoneStopping now sets stopDone = true before notifying/closing watchers. Added a test to ensure no panic when waiting after done-stopping.

Changes

Cohort / File(s)	Summary
Stop/Wait Coordination Logic `internal/oci/container.go`	Added `stopDone bool` to `Container`. `WaitOnStopTimeout` now early-returns when not stopping or `stopDone` is true. `SetAsDoneStopping` sets `stopDone = true` before closing stop watchers and clearing timeout channel.
Synchronization Test Coverage `internal/oci/container_test.go`	Added regression test ensuring `WaitOnStopTimeout` does not panic when called after `SetAsDoneStopping`; adjusted related test to validate watcher/channel clearing and subsequent panic behavior on second call.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

mrunalp
littlejawa
QiWang19

Poem

🐰 I set a flag and hopped aside,
The watchers closed; no panic cried.
Waiters call when work is done,
Threads settle under the sun.
🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title directly describes the main fix: preventing a panic on the closed stopTimeoutChan channel in StopContainer, which is the core issue addressed in this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bitoku

lgtm, let's make it ready for review so that we can run the tests

First I thought we can put WaitOnStopTimeout in if c.SetAsStopping(), but it's not that simple because the subsequent StopContainer calls can't wait.

bitoku · 2026-03-05T16:29:07Z

+		It("should not panic when WaitOnStopTimeout is called after SetAsDoneStopping", func() {
+
+			ctx := context.Background()
+
+			sut.SetAsStopping()
+
+			// Simulation of the stop loop finishing.
+			sut.SetAsDoneStopping()
+
+			Expect(func() {
+				sut.WaitOnStopTimeout(ctx, 1000)
+			}).ToNot(Panic())
+		})
+


I want more explanation here, about why we want to check this, otherwise developers in the future may have trouble understanding why we check this.

@sabujmaity

@bitoku I have put up the explanation !

How about this

// Regression test for a race between concurrent StopContainer calls. // When a second StopContainer arrives after the first has already // completed (SetAsDoneStopping closed stopTimeoutChan), // WaitOnStopTimeout used to panic on the closed channel. // The stopDone guard ensures it returns early instead.

Sure thing. This is a better explanation ! Thanks for the suggestions.

bitoku · 2026-03-05T23:34:14Z

/ok-to-test

sabujmaity · 2026-03-06T10:07:24Z

/retest-required

codecov · 2026-03-06T13:15:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.64%. Comparing base (7421a8e) to head (d51616f).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9799      +/-   ##
==========================================
+ Coverage   67.45%   67.64%   +0.19%     
==========================================
  Files         210      212       +2     
  Lines       29123    29237     +114     
==========================================
+ Hits        19644    19777     +133     
+ Misses       7790     7774      -16     
+ Partials     1689     1686       -3

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sabujmaity · 2026-03-09T15:31:46Z

/retest

A race condition occurs when a second StopContainer call arrives after the container has already been marked as done stopping. Specifically, SetAsDoneStopping closes the stopTimeoutChan, and subsequent calls attempting to interact with or close this channel result in a "panic: close of closed channel". This patch adds a guard using the stopDone internal state within WaitOnStopTimeout to ensure we return early if the stop lifecycle has already completed, preventing redundant channel operations. Addresses: OCPBUGS-76451 Signed-off-by: Sabuj Maity <samaity@redhat.com>

coderabbitai

🧹 Nitpick comments (1)

internal/oci/container.go (1)
72-74: Consider adding a comment explaining the purpose of stopDone.

Other stop-related fields like stopTimeoutChan have comments explaining their role. A brief comment here would clarify why this flag exists (to guard against the race condition when WaitOnStopTimeout is called after SetAsDoneStopping has already closed the channel).
💡 Suggested comment
 	stopping           bool
+	// stopDone guards WaitOnStopTimeout from using stopTimeoutChan after
+	// SetAsDoneStopping has closed it, preventing a panic on concurrent
+	// StopContainer calls that race against the stop lifecycle completing.
 	stopDone           bool
 	stopLock           sync.Mutex
As per coding guidelines: "Add comments explaining 'why' not 'what' in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/container.go` around lines 72 - 74, Add a brief Go comment above
the stopDone field explaining why it exists (not what it is): state that
stopDone is a boolean used to indicate the stop completion to avoid a race where
WaitOnStopTimeout may be invoked after SetAsDoneStopping has already closed
stopTimeoutChan; mention it is checked/guarded under stopLock to prevent
double-close/race conditions with stopTimeoutChan and other stop-related logic
(references: stopDone, stopLock, stopTimeoutChan, WaitOnStopTimeout,
SetAsDoneStopping).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/oci/container.go`:
- Around line 72-74: Add a brief Go comment above the stopDone field explaining
why it exists (not what it is): state that stopDone is a boolean used to
indicate the stop completion to avoid a race where WaitOnStopTimeout may be
invoked after SetAsDoneStopping has already closed stopTimeoutChan; mention it
is checked/guarded under stopLock to prevent double-close/race conditions with
stopTimeoutChan and other stop-related logic (references: stopDone, stopLock,
stopTimeoutChan, WaitOnStopTimeout, SetAsDoneStopping).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b94841fd-8b3b-405e-a64e-e97891214235

📥 Commits

Reviewing files that changed from the base of the PR and between 142482e and d51616f.

📒 Files selected for processing (2)

internal/oci/container.go
internal/oci/container_test.go

bitoku · 2026-03-10T14:44:21Z

/lgtm if the tests pass

bitoku · 2026-03-11T11:42:54Z

/lgtm

bitoku · 2026-03-11T11:43:08Z

@cri-o/cri-o-maintainers PTAL

openshift-ci-robot · 2026-03-11T16:49:57Z

@sabujmaity: This pull request references Jira Issue OCPBUGS-76451, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/kind bug

What this PR does / why we need it:

There is a race condition in the container stop path. When a second StopContainer call
comes in after the first one has already finished (i.e. SetAsDoneStopping has
run), WaitOnStopTimeout still gets called. At that point stopTimeoutChan
is already closed, so we panic.

The sequence is:

First StopContainer -> starts the stop loop, waits via WaitOnStopTimeout

Stop loop finishes -> SetAsDoneStopping closes stopTimeoutChan and watchers

Second StopContainer -> SetAsStopping returns false (already stopping),
falls through to WaitOnStopTimeout which tries to use the closed channel

This adds a stopDone bool (guarded by the existing stopLock) that gets set
in SetAsDoneStopping. WaitOnStopTimeout checks it and returns early if
the stop lifecycle is already done.

Which issue(s) this PR fixes:

Addresses: OCPBUGS-76451

Special notes for your reviewer:

The change is small : one new bool field, one assignment, one extra condition
check. All under the existing stopLock so no new synchronization concerns.

Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeout and confirms it
doesn't panic.

Does this PR introduce a user-facing change?
Fixed a panic when concurrent StopContainer calls race against the stop lifecycle completing.
Summary by CodeRabbit

Bug Fixes

Prevented rare panics during container shutdown by improving coordination between stop and timeout handling, making shutdown sequences more reliable and resilient.

Tests

Added regression tests to ensure stop/wait operations do not panic when invoked after shutdown completion, reducing regressions and increasing stability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bitoku · 2026-03-11T16:50:46Z

/jira refresh

openshift-ci-robot · 2026-03-11T16:50:57Z

@bitoku: This pull request references Jira Issue OCPBUGS-76451, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lyman9966

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-11T17:10:40Z

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lyman9966.

Note that only cri-o members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@bitoku: This pull request references Jira Issue OCPBUGS-76451, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target version (4.22.0) matches configured target version for branch (4.22.0)

bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lyman9966

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-03-13T10:07:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sabujmaity, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-03-13T10:12:24Z

@sabujmaity: Jira Issue OCPBUGS-76451: All pull requests linked via external trackers have merged:

cri-o/cri-o#9799

Jira Issue OCPBUGS-76451 has been moved to the MODIFIED state.

Details

In response to this:

/kind bug

What this PR does / why we need it:

There is a race condition in the container stop path. When a second StopContainer call
comes in after the first one has already finished (i.e. SetAsDoneStopping has
run), WaitOnStopTimeout still gets called. At that point stopTimeoutChan
is already closed, so we panic.

The sequence is:

First StopContainer -> starts the stop loop, waits via WaitOnStopTimeout

Stop loop finishes -> SetAsDoneStopping closes stopTimeoutChan and watchers

Second StopContainer -> SetAsStopping returns false (already stopping),
falls through to WaitOnStopTimeout which tries to use the closed channel

This adds a stopDone bool (guarded by the existing stopLock) that gets set
in SetAsDoneStopping. WaitOnStopTimeout checks it and returns early if
the stop lifecycle is already done.

Which issue(s) this PR fixes:

Addresses: OCPBUGS-76451

Special notes for your reviewer:

The change is small : one new bool field, one assignment, one extra condition
check. All under the existing stopLock so no new synchronization concerns.

Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeout and confirms it
doesn't panic.

Does this PR introduce a user-facing change?
Fixed a panic when concurrent StopContainer calls race against the stop lifecycle completing.
Summary by CodeRabbit

Bug Fixes

Prevented rare panics during container shutdown by improving coordination between stop and timeout handling, making shutdown sequences more reliable and resilient.

Tests

Added regression tests to ensure stop/wait operations do not panic when invoked after shutdown completion, reducing regressions and increasing stability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bitoku · 2026-03-13T10:38:13Z

/cherry-pick release-1.35

openshift-cherrypick-robot · 2026-03-13T10:39:15Z

@bitoku: new pull request created: #9814

Details

In response to this:

/cherry-pick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add a stopDone boolean guard to the Container struct. When SetAsDoneStopping closes the stopTimeoutChan, it also sets stopDone. WaitOnStopTimeout checks this flag and returns early, preventing a panic from sending on a closed channel when a second StopContainer call arrives after the first has completed. This is a manual cherry-pick of #9799 to release-1.34, adapted for the branch's test file structure. Signed-off-by: Sabuj Maity <samaity@redhat.com>

Add a stopDone boolean guard to the Container struct. When SetAsDoneStopping closes the stopTimeoutChan, it also sets stopDone. WaitOnStopTimeout checks this flag and returns early, preventing a panic from sending on a closed channel when a second StopContainer call arrives after the first has completed. This is a manual cherry-pick of cri-o#9799 to release-1.33, adapted for the branch's test file structure. Signed-off-by: Sabuj Maity <samaity@redhat.com>

openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 5, 2026

bitoku reviewed Mar 5, 2026

View reviewed changes

openshift-ci Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 5, 2026

openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 5, 2026

sabujmaity marked this pull request as ready for review March 6, 2026 06:52

sabujmaity requested a review from mrunalp as a code owner March 6, 2026 06:52

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2026

openshift-ci Bot requested review from QiWang19 and littlejawa March 6, 2026 06:52

sabujmaity force-pushed the fix-ocbugs-76451-panic branch from b764363 to e7f536e Compare March 6, 2026 19:48

sabujmaity force-pushed the fix-ocbugs-76451-panic branch from e7f536e to 142482e Compare March 10, 2026 05:43

sabujmaity force-pushed the fix-ocbugs-76451-panic branch from 142482e to d51616f Compare March 10, 2026 13:30

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

openshift-ci Bot assigned bitoku Mar 11, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2026

bitoku changed the title ~~fix: prevent panic on closed stopTimeoutChan in StopContainer~~ OCPBUGS-76451: fix: prevent panic on closed stopTimeoutChan in StopContainer Mar 11, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026

saschagrunert approved these changes Mar 13, 2026

View reviewed changes

openshift-ci Bot assigned saschagrunert Mar 13, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026

openshift-merge-bot Bot merged commit 819d0dd into cri-o:main Mar 13, 2026
74 checks passed

openshift-cherrypick-robot mentioned this pull request Mar 13, 2026

OCPBUGS-76451: [release-1.35] : fix: prevent panic on closed stopTimeoutChan in StopContainer #9814

Merged

sabujmaity mentioned this pull request May 4, 2026

OCPBUGS-84922: fix: prevent panic on concurrent StopContainer calls #9920

Merged

sabujmaity mentioned this pull request May 6, 2026

[release-1.33]: OCPBUGS-84972: fix: prevent panic on concurrent StopContainer calls #9926

Open

Conversation

sabujmaity commented Mar 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Mar 5, 2026

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

bitoku left a comment

Choose a reason for hiding this comment

Uh oh!

bitoku Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bitoku Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

sabujmaity Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bitoku Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

sabujmaity Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bitoku commented Mar 5, 2026

Uh oh!

sabujmaity commented Mar 6, 2026

Uh oh!

codecov Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sabujmaity commented Mar 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bitoku commented Mar 10, 2026

Uh oh!

bitoku commented Mar 11, 2026

Uh oh!

bitoku commented Mar 11, 2026

Uh oh!

openshift-ci-robot commented Mar 11, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Summary by CodeRabbit

Uh oh!

bitoku commented Mar 11, 2026

Uh oh!

openshift-ci-robot commented Mar 11, 2026

Uh oh!

openshift-ci Bot commented Mar 11, 2026

Uh oh!

openshift-ci Bot commented Mar 13, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 13, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Summary by CodeRabbit

Uh oh!

sabujmaity commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

bitoku Mar 5, 2026 •

edited

Loading

codecov Bot commented Mar 6, 2026 •

edited

Loading