Skip to content

OCPBUGS-76451: fix: prevent panic on closed stopTimeoutChan in StopContainer#9799

Merged
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
sabujmaity:fix-ocbugs-76451-panic
Mar 13, 2026
Merged

OCPBUGS-76451: fix: prevent panic on closed stopTimeoutChan in StopContainer#9799
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
sabujmaity:fix-ocbugs-76451-panic

Conversation

@sabujmaity

@sabujmaity sabujmaity commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

/kind bug

What this PR does / why we need it:

There is a race condition in the container stop path. When a second StopContainer call
comes in after the first one has already finished (i.e. SetAsDoneStopping has
run), WaitOnStopTimeout still gets called. At that point stopTimeoutChan
is already closed, so we panic.

The sequence is:

  1. First StopContainer -> starts the stop loop, waits via WaitOnStopTimeout
  2. Stop loop finishes -> SetAsDoneStopping closes stopTimeoutChan and watchers
  3. Second StopContainer -> SetAsStopping returns false (already stopping),
    falls through to WaitOnStopTimeout which tries to use the closed channel

This adds a stopDone bool (guarded by the existing stopLock) that gets set
in SetAsDoneStopping. WaitOnStopTimeout checks it and returns early if
the stop lifecycle is already done.

Which issue(s) this PR fixes:

Addresses: OCPBUGS-76451

Special notes for your reviewer:

The change is small : one new bool field, one assignment, one extra condition
check. All under the existing stopLock so no new synchronization concerns.

Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeout and confirms it
doesn't panic.

Does this PR introduce a user-facing change?

Fixed a panic when concurrent StopContainer calls race against the stop lifecycle completing.

Summary by CodeRabbit

  • Bug Fixes

    • Prevented rare panics during container shutdown by improving coordination between stop and timeout handling, making shutdown sequences more reliable and resilient.
  • Tests

    • Added regression tests to ensure stop/wait operations do not panic when invoked after shutdown completion, reducing regressions and increasing stability.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 5, 2026
@openshift-ci

openshift-ci Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Hi @sabujmaity. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 5, 2026
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Added a stopDone boolean to Container and adjusted stop coordination so WaitOnStopTimeout returns early if not stopping or stopDone is true. SetAsDoneStopping now sets stopDone = true before notifying/closing watchers. Added a test to ensure no panic when waiting after done-stopping.

Changes

Cohort / File(s) Summary
Stop/Wait Coordination Logic
internal/oci/container.go
Added stopDone bool to Container. WaitOnStopTimeout now early-returns when not stopping or stopDone is true. SetAsDoneStopping sets stopDone = true before closing stop watchers and clearing timeout channel.
Synchronization Test Coverage
internal/oci/container_test.go
Added regression test ensuring WaitOnStopTimeout does not panic when called after SetAsDoneStopping; adjusted related test to validate watcher/channel clearing and subsequent panic behavior on second call.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • mrunalp
  • littlejawa
  • QiWang19

Poem

🐰 I set a flag and hopped aside,
The watchers closed; no panic cried.
Waiters call when work is done,
Threads settle under the sun.
🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title directly describes the main fix: preventing a panic on the closed stopTimeoutChan channel in StopContainer, which is the core issue addressed in this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bitoku bitoku left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's make it ready for review so that we can run the tests

First I thought we can put WaitOnStopTimeout in if c.SetAsStopping(), but it's not that simple because the subsequent StopContainer calls can't wait.

Comment on lines +1011 to +1024
It("should not panic when WaitOnStopTimeout is called after SetAsDoneStopping", func() {

ctx := context.Background()

sut.SetAsStopping()

// Simulation of the stop loop finishing.
sut.SetAsDoneStopping()

Expect(func() {
sut.WaitOnStopTimeout(ctx, 1000)
}).ToNot(Panic())
})

@bitoku bitoku Mar 5, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want more explanation here, about why we want to check this, otherwise developers in the future may have trouble understanding why we check this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitoku I have put up the explanation !

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this

// Regression test for a race between concurrent StopContainer calls.
// When a second StopContainer arrives after the first has already
// completed (SetAsDoneStopping closed stopTimeoutChan),
// WaitOnStopTimeout used to panic on the closed channel.
// The stopDone guard ensures it returns early instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. This is a better explanation ! Thanks for the suggestions.

@openshift-ci openshift-ci Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 5, 2026
@bitoku

bitoku commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 5, 2026
@sabujmaity sabujmaity marked this pull request as ready for review March 6, 2026 06:52
@sabujmaity sabujmaity requested a review from mrunalp as a code owner March 6, 2026 06:52
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2026
@openshift-ci openshift-ci Bot requested review from QiWang19 and littlejawa March 6, 2026 06:52
@sabujmaity

Copy link
Copy Markdown
Contributor Author

/retest-required

@codecov

codecov Bot commented Mar 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.64%. Comparing base (7421a8e) to head (d51616f).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9799      +/-   ##
==========================================
+ Coverage   67.45%   67.64%   +0.19%     
==========================================
  Files         210      212       +2     
  Lines       29123    29237     +114     
==========================================
+ Hits        19644    19777     +133     
+ Misses       7790     7774      -16     
+ Partials     1689     1686       -3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sabujmaity sabujmaity force-pushed the fix-ocbugs-76451-panic branch from b764363 to e7f536e Compare March 6, 2026 19:48
@sabujmaity

Copy link
Copy Markdown
Contributor Author

/retest

@sabujmaity sabujmaity force-pushed the fix-ocbugs-76451-panic branch from e7f536e to 142482e Compare March 10, 2026 05:43
A race condition occurs when a second StopContainer call arrives
after the container has already been marked as done stopping.
Specifically, SetAsDoneStopping closes the stopTimeoutChan, and
subsequent calls attempting to interact with or close this
channel result in a "panic: close of closed channel".

This patch adds a guard using the stopDone internal state within
WaitOnStopTimeout to ensure we return early if the stop lifecycle
has already completed, preventing redundant channel operations.

Addresses: OCPBUGS-76451
Signed-off-by: Sabuj Maity <samaity@redhat.com>
@sabujmaity sabujmaity force-pushed the fix-ocbugs-76451-panic branch from 142482e to d51616f Compare March 10, 2026 13:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/oci/container.go (1)

72-74: Consider adding a comment explaining the purpose of stopDone.

Other stop-related fields like stopTimeoutChan have comments explaining their role. A brief comment here would clarify why this flag exists (to guard against the race condition when WaitOnStopTimeout is called after SetAsDoneStopping has already closed the channel).

💡 Suggested comment
 	stopping           bool
+	// stopDone guards WaitOnStopTimeout from using stopTimeoutChan after
+	// SetAsDoneStopping has closed it, preventing a panic on concurrent
+	// StopContainer calls that race against the stop lifecycle completing.
 	stopDone           bool
 	stopLock           sync.Mutex

As per coding guidelines: "Add comments explaining 'why' not 'what' in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/oci/container.go` around lines 72 - 74, Add a brief Go comment above
the stopDone field explaining why it exists (not what it is): state that
stopDone is a boolean used to indicate the stop completion to avoid a race where
WaitOnStopTimeout may be invoked after SetAsDoneStopping has already closed
stopTimeoutChan; mention it is checked/guarded under stopLock to prevent
double-close/race conditions with stopTimeoutChan and other stop-related logic
(references: stopDone, stopLock, stopTimeoutChan, WaitOnStopTimeout,
SetAsDoneStopping).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/oci/container.go`:
- Around line 72-74: Add a brief Go comment above the stopDone field explaining
why it exists (not what it is): state that stopDone is a boolean used to
indicate the stop completion to avoid a race where WaitOnStopTimeout may be
invoked after SetAsDoneStopping has already closed stopTimeoutChan; mention it
is checked/guarded under stopLock to prevent double-close/race conditions with
stopTimeoutChan and other stop-related logic (references: stopDone, stopLock,
stopTimeoutChan, WaitOnStopTimeout, SetAsDoneStopping).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b94841fd-8b3b-405e-a64e-e97891214235

📥 Commits

Reviewing files that changed from the base of the PR and between 142482e and d51616f.

📒 Files selected for processing (2)
  • internal/oci/container.go
  • internal/oci/container_test.go

@bitoku

bitoku commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

/lgtm if the tests pass

@bitoku

bitoku commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2026
@bitoku

bitoku commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

@cri-o/cri-o-maintainers PTAL

@bitoku bitoku changed the title fix: prevent panic on closed stopTimeoutChan in StopContainer OCPBUGS-76451: fix: prevent panic on closed stopTimeoutChan in StopContainer Mar 11, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sabujmaity: This pull request references Jira Issue OCPBUGS-76451, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/kind bug

What this PR does / why we need it:

There is a race condition in the container stop path. When a second StopContainer call
comes in after the first one has already finished (i.e. SetAsDoneStopping has
run), WaitOnStopTimeout still gets called. At that point stopTimeoutChan
is already closed, so we panic.

The sequence is:

  1. First StopContainer -> starts the stop loop, waits via WaitOnStopTimeout
  2. Stop loop finishes -> SetAsDoneStopping closes stopTimeoutChan and watchers
  3. Second StopContainer -> SetAsStopping returns false (already stopping),
    falls through to WaitOnStopTimeout which tries to use the closed channel

This adds a stopDone bool (guarded by the existing stopLock) that gets set
in SetAsDoneStopping. WaitOnStopTimeout checks it and returns early if
the stop lifecycle is already done.

Which issue(s) this PR fixes:

Addresses: OCPBUGS-76451

Special notes for your reviewer:

The change is small : one new bool field, one assignment, one extra condition
check. All under the existing stopLock so no new synchronization concerns.

Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeout and confirms it
doesn't panic.

Does this PR introduce a user-facing change?

Fixed a panic when concurrent StopContainer calls race against the stop lifecycle completing.

Summary by CodeRabbit

  • Bug Fixes

  • Prevented rare panics during container shutdown by improving coordination between stop and timeout handling, making shutdown sequences more reliable and resilient.

  • Tests

  • Added regression tests to ensure stop/wait operations do not panic when invoked after shutdown completion, reducing regressions and increasing stability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku

bitoku commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-76451, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lyman9966

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lyman9966.

Note that only cri-o members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@bitoku: This pull request references Jira Issue OCPBUGS-76451, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lyman9966

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sabujmaity, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 819d0dd into cri-o:main Mar 13, 2026
74 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@sabujmaity: Jira Issue OCPBUGS-76451: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-76451 has been moved to the MODIFIED state.

Details

In response to this:

/kind bug

What this PR does / why we need it:

There is a race condition in the container stop path. When a second StopContainer call
comes in after the first one has already finished (i.e. SetAsDoneStopping has
run), WaitOnStopTimeout still gets called. At that point stopTimeoutChan
is already closed, so we panic.

The sequence is:

  1. First StopContainer -> starts the stop loop, waits via WaitOnStopTimeout
  2. Stop loop finishes -> SetAsDoneStopping closes stopTimeoutChan and watchers
  3. Second StopContainer -> SetAsStopping returns false (already stopping),
    falls through to WaitOnStopTimeout which tries to use the closed channel

This adds a stopDone bool (guarded by the existing stopLock) that gets set
in SetAsDoneStopping. WaitOnStopTimeout checks it and returns early if
the stop lifecycle is already done.

Which issue(s) this PR fixes:

Addresses: OCPBUGS-76451

Special notes for your reviewer:

The change is small : one new bool field, one assignment, one extra condition
check. All under the existing stopLock so no new synchronization concerns.

Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeout and confirms it
doesn't panic.

Does this PR introduce a user-facing change?

Fixed a panic when concurrent StopContainer calls race against the stop lifecycle completing.

Summary by CodeRabbit

  • Bug Fixes

  • Prevented rare panics during container shutdown by improving coordination between stop and timeout handling, making shutdown sequences more reliable and resilient.

  • Tests

  • Added regression tests to ensure stop/wait operations do not panic when invoked after shutdown completion, reducing regressions and increasing stability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku

bitoku commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

/cherry-pick release-1.35

@openshift-cherrypick-robot

Copy link
Copy Markdown

@bitoku: new pull request created: #9814

Details

In response to this:

/cherry-pick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-bot Bot pushed a commit that referenced this pull request May 5, 2026
Add a stopDone boolean guard to the Container struct. When
SetAsDoneStopping closes the stopTimeoutChan, it also sets stopDone.
WaitOnStopTimeout checks this flag and returns early, preventing a
panic from sending on a closed channel when a second StopContainer
call arrives after the first has completed.

This is a manual cherry-pick of #9799 to release-1.34, adapted for
the branch's test file structure.

Signed-off-by: Sabuj Maity <samaity@redhat.com>
sabujmaity added a commit to sabujmaity/cri-o that referenced this pull request May 6, 2026
Add a stopDone boolean guard to the Container struct. When
SetAsDoneStopping closes the stopTimeoutChan, it also sets stopDone.
WaitOnStopTimeout checks this flag and returns early, preventing a
panic from sending on a closed channel when a second StopContainer
call arrives after the first has completed.
This is a manual cherry-pick of cri-o#9799 to release-1.33, adapted for
the branch's test file structure.

Signed-off-by: Sabuj Maity <samaity@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants