OCPBUGS-76451: fix: prevent panic on closed stopTimeoutChan in StopContainer#9799
Conversation
|
Hi @sabujmaity. Thanks for your PR. I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughAdded a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
bitoku
left a comment
There was a problem hiding this comment.
lgtm, let's make it ready for review so that we can run the tests
First I thought we can put WaitOnStopTimeout in if c.SetAsStopping(), but it's not that simple because the subsequent StopContainer calls can't wait.
| It("should not panic when WaitOnStopTimeout is called after SetAsDoneStopping", func() { | ||
|
|
||
| ctx := context.Background() | ||
|
|
||
| sut.SetAsStopping() | ||
|
|
||
| // Simulation of the stop loop finishing. | ||
| sut.SetAsDoneStopping() | ||
|
|
||
| Expect(func() { | ||
| sut.WaitOnStopTimeout(ctx, 1000) | ||
| }).ToNot(Panic()) | ||
| }) | ||
|
|
There was a problem hiding this comment.
I want more explanation here, about why we want to check this, otherwise developers in the future may have trouble understanding why we check this.
There was a problem hiding this comment.
How about this
// Regression test for a race between concurrent StopContainer calls.
// When a second StopContainer arrives after the first has already
// completed (SetAsDoneStopping closed stopTimeoutChan),
// WaitOnStopTimeout used to panic on the closed channel.
// The stopDone guard ensures it returns early instead.
There was a problem hiding this comment.
Sure thing. This is a better explanation ! Thanks for the suggestions.
|
/ok-to-test |
|
/retest-required |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #9799 +/- ##
==========================================
+ Coverage 67.45% 67.64% +0.19%
==========================================
Files 210 212 +2
Lines 29123 29237 +114
==========================================
+ Hits 19644 19777 +133
+ Misses 7790 7774 -16
+ Partials 1689 1686 -3 🚀 New features to boost your workflow:
|
b764363 to
e7f536e
Compare
|
/retest |
e7f536e to
142482e
Compare
A race condition occurs when a second StopContainer call arrives after the container has already been marked as done stopping. Specifically, SetAsDoneStopping closes the stopTimeoutChan, and subsequent calls attempting to interact with or close this channel result in a "panic: close of closed channel". This patch adds a guard using the stopDone internal state within WaitOnStopTimeout to ensure we return early if the stop lifecycle has already completed, preventing redundant channel operations. Addresses: OCPBUGS-76451 Signed-off-by: Sabuj Maity <samaity@redhat.com>
142482e to
d51616f
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/oci/container.go (1)
72-74: Consider adding a comment explaining the purpose ofstopDone.Other stop-related fields like
stopTimeoutChanhave comments explaining their role. A brief comment here would clarify why this flag exists (to guard against the race condition whenWaitOnStopTimeoutis called afterSetAsDoneStoppinghas already closed the channel).💡 Suggested comment
stopping bool + // stopDone guards WaitOnStopTimeout from using stopTimeoutChan after + // SetAsDoneStopping has closed it, preventing a panic on concurrent + // StopContainer calls that race against the stop lifecycle completing. stopDone bool stopLock sync.MutexAs per coding guidelines: "Add comments explaining 'why' not 'what' in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/oci/container.go` around lines 72 - 74, Add a brief Go comment above the stopDone field explaining why it exists (not what it is): state that stopDone is a boolean used to indicate the stop completion to avoid a race where WaitOnStopTimeout may be invoked after SetAsDoneStopping has already closed stopTimeoutChan; mention it is checked/guarded under stopLock to prevent double-close/race conditions with stopTimeoutChan and other stop-related logic (references: stopDone, stopLock, stopTimeoutChan, WaitOnStopTimeout, SetAsDoneStopping).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/oci/container.go`:
- Around line 72-74: Add a brief Go comment above the stopDone field explaining
why it exists (not what it is): state that stopDone is a boolean used to
indicate the stop completion to avoid a race where WaitOnStopTimeout may be
invoked after SetAsDoneStopping has already closed stopTimeoutChan; mention it
is checked/guarded under stopLock to prevent double-close/race conditions with
stopTimeoutChan and other stop-related logic (references: stopDone, stopLock,
stopTimeoutChan, WaitOnStopTimeout, SetAsDoneStopping).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b94841fd-8b3b-405e-a64e-e97891214235
📒 Files selected for processing (2)
internal/oci/container.gointernal/oci/container_test.go
|
/lgtm if the tests pass |
|
/lgtm |
|
@cri-o/cri-o-maintainers PTAL |
|
@sabujmaity: This pull request references Jira Issue OCPBUGS-76451, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@bitoku: This pull request references Jira Issue OCPBUGS-76451, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lyman9966. Note that only cri-o members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sabujmaity, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@sabujmaity: Jira Issue OCPBUGS-76451: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-76451 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-1.35 |
|
@bitoku: new pull request created: #9814 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Add a stopDone boolean guard to the Container struct. When SetAsDoneStopping closes the stopTimeoutChan, it also sets stopDone. WaitOnStopTimeout checks this flag and returns early, preventing a panic from sending on a closed channel when a second StopContainer call arrives after the first has completed. This is a manual cherry-pick of #9799 to release-1.34, adapted for the branch's test file structure. Signed-off-by: Sabuj Maity <samaity@redhat.com>
Add a stopDone boolean guard to the Container struct. When SetAsDoneStopping closes the stopTimeoutChan, it also sets stopDone. WaitOnStopTimeout checks this flag and returns early, preventing a panic from sending on a closed channel when a second StopContainer call arrives after the first has completed. This is a manual cherry-pick of cri-o#9799 to release-1.33, adapted for the branch's test file structure. Signed-off-by: Sabuj Maity <samaity@redhat.com>
/kind bug
What this PR does / why we need it:
There is a race condition in the container stop path. When a second
StopContainercallcomes in after the first one has already finished (i.e.
SetAsDoneStoppinghasrun),
WaitOnStopTimeoutstill gets called. At that pointstopTimeoutChanis already closed, so we panic.
The sequence is:
StopContainer-> starts the stop loop, waits viaWaitOnStopTimeoutSetAsDoneStoppingclosesstopTimeoutChanand watchersStopContainer->SetAsStoppingreturns false (already stopping),falls through to
WaitOnStopTimeoutwhich tries to use the closed channelThis adds a
stopDonebool (guarded by the existingstopLock) that gets setin
SetAsDoneStopping.WaitOnStopTimeoutchecks it and returns early ifthe stop lifecycle is already done.
Which issue(s) this PR fixes:
Addresses: OCPBUGS-76451
Special notes for your reviewer:
The change is small : one new bool field, one assignment, one extra condition
check. All under the existing
stopLockso no new synchronization concerns.Added a unit test that exercises the exact crash sequence:
SetAsStopping -> SetAsDoneStopping -> WaitOnStopTimeoutand confirms itdoesn't panic.
Does this PR introduce a user-facing change?
Summary by CodeRabbit
Bug Fixes
Tests