cnimgr: continuously poll CNI STATUS to detect runtime health changes#9855
Conversation
|
Hi @tsorya. Thanks for your PR. I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaced the startup poll with a cancellable two-phase continuous poll (fast readiness, slow health monitor), added Changes
Sequence DiagramsequenceDiagram
participant Client
participant CNIManager
participant FastPhase as Phase 1 (Fast Readiness)
participant SlowPhase as Phase 2 (Health Monitor)
participant Plugin
participant GC
participant Watchers
Client->>CNIManager: Start polling
CNIManager->>FastPhase: pollContinuously(ctx) -> start fast phase
loop Fast Phase (short interval)
FastPhase->>Plugin: StatusWithContext(ctx)
alt healthy
Plugin-->>FastPhase: nil
FastPhase->>GC: doGC(ctx)
GC-->>FastPhase: done
FastPhase->>Watchers: send true (non-blocking)
FastPhase->>SlowPhase: transition to slow monitoring
else unhealthy
Plugin-->>FastPhase: error
FastPhase->>FastPhase: continue fast polling
end
end
loop Slow Phase (long interval)
SlowPhase->>Plugin: StatusWithContext(ctx)
alt recovery (error -> nil)
Plugin-->>SlowPhase: nil
SlowPhase->>GC: doGC(ctx)
SlowPhase->>Watchers: send true (non-blocking)
else failure (nil -> error)
Plugin-->>SlowPhase: error
SlowPhase->>Watchers: send false (non-blocking)
else unchanged
SlowPhase->>SlowPhase: continue monitoring
end
end
Client->>CNIManager: Shutdown()
CNIManager->>CNIManager: cancel polling context
CNIManager->>CNIManager: lastError = errShutdown
CNIManager->>Watchers: send false (non-blocking) and clear watchers
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4cb7a9b to
09adc49
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/config/cnimgr/cnimgr_test_inject.go`:
- Around line 23-24: The injected test call passes a nil context into
c.statusPollFunc which can lead to a panic when statusPollFunc invokes doGC that
uses context.WithTimeout; update the test to pass a real context (e.g.,
context.Background() or context.TODO()) into the call to c.statusPollFunc
instead of nil so statusPollFunc and doGC receive a valid context.
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 91-99: When broadcasting readiness/shutdown, treat watcher
channels as terminal: in the loop that iterates c.watchers (the non-blocking
send shown) send the signal, close each watcher channel, and then clear
c.watchers so stale waiters are not revisited; additionally modify AddWatcher()
to check c.lastError (and errShutdown) under the same lock and if set return a
channel that is already signaled/closed instead of appending a new watcher to
c.watchers. This ensures Shutdown()/readiness broadcasts fully terminate
watchers and prevents callers from being appended to a canceled poller and
blocking forever.
- Around line 54-60: In pollContinuously, statusPollFunc is calling
c.plugin.Status() without the request context which can block while holding the
mutex; update statusPollFunc (used in pollContinuously) to call
c.plugin.StatusWithContext(ctx) and pass the provided ctx into that call (and do
the same for the second-phase polling lambda), so status checks honor
cancellation and do not stall ReadyOrError, AddWatcher, or Shutdown; locate the
statusPollFunc closures referenced in pollContinuously and replace Status() with
StatusWithContext(ctx).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c53c07ed-0ac5-43e2-8c89-a60320631769
📒 Files selected for processing (3)
internal/config/cnimgr/cnimgr.gointernal/config/cnimgr/cnimgr_test.gointernal/config/cnimgr/cnimgr_test_inject.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
internal/config/cnimgr/cnimgr.go (2)
94-99:⚠️ Potential issue | 🔴 CriticalTreat watcher channels as terminal after readiness/shutdown.
AddWatcher()currently appends whenlastError != nil, includingerrShutdown(Line 127), so callers can get parked forever after shutdown. Also, watcher lists are not cleared after notifications (Lines 94-99 and 150-155), so stale channels are revisited repeatedly.Suggested fix
func (c *CNIManager) AddWatcher() chan bool { c.mutex.Lock() defer c.mutex.Unlock() watcher := make(chan bool, 1) - if c.lastError == nil { + switch { + case c.lastError == nil: watcher <- true - } else { + close(watcher) + case errors.Is(c.lastError, errShutdown): + watcher <- false + close(watcher) + default: c.watchers = append(c.watchers, watcher) } return watcher }for _, watcher := range c.watchers { select { case watcher <- true: default: } + close(watcher) } + c.watchers = nil }for _, watcher := range c.watchers { select { case watcher <- false: default: } + close(watcher) } + c.watchers = nil }Also applies to: 122-133, 136-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 94 - 99, The watcher channels are treated as persistent but must be terminal: update AddWatcher() to NOT append new watchers when c.lastError != nil (including errShutdown) so callers don't get parked after shutdown, and change the notification logic in the loop that iterates c.watchers (and the similar block around lines 136-155) to treat channels as one‑time: send the readiness/shutdown signal non-blockingly to each watcher (select { case watcher <- true: default: }), then remove/clear that watcher from the slice (or set c.watchers = nil) so stale channels are not revisited; ensure this uses the existing symbols AddWatcher, c.watchers, c.lastError and errShutdown so behavior is consistent across readiness and shutdown paths.
72-76:⚠️ Potential issue | 🟠 MajorUse context-aware CNI status checks (and avoid status call under the lock).
At Line 76,
c.plugin.Status()ignores the providedctx. If status blocks, it can hold the write lock longer than necessary and stallReadyOrError(),AddWatcher(), andShutdown()paths.Suggested fix
func (c *CNIManager) statusPollFunc(ctx context.Context, stopOnReady bool) (bool, error) { - c.mutex.Lock() - defer c.mutex.Unlock() - - if err := c.plugin.Status(); err != nil { + if err := c.plugin.StatusWithContext(ctx); err != nil { + c.mutex.Lock() c.lastError = err + c.mutex.Unlock() return false, nil } + c.mutex.Lock() + defer c.mutex.Unlock() + if c.lastError != nil { c.lastError = nil#!/bin/bash set -euo pipefail # Verify ocicni interface methods and current cnimgr usage. rg -n -C2 'type CNIPlugin interface|StatusWithContext\(|Status\(' \ vendor/github.com/cri-o/ocicni/pkg/ocicni/types.go \ internal/config/cnimgr/cnimgr.goAs per coding guidelines, "Propagate context.Context through function calls in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 72 - 76, The statusPollFunc currently calls c.plugin.Status() while holding c.mutex and without passing ctx; change it to avoid calling the blocking status under the lock and to use a context-aware API: grab a local reference to the plugin (e.g., p := c.plugin) while holding c.mutex, release the lock, then call the context-aware status method (preferably p.StatusWithContext(ctx) if available, or add/use a StatusWithContext wrapper on the CNI plugin) and only re-acquire the lock to update manager state based on the result; update statusPollFunc, and ensure callers like ReadyOrError(), AddWatcher(), and Shutdown() are not blocked by the status call.
🧹 Nitpick comments (1)
internal/config/cnimgr/cnimgr.go (1)
88-90: Prefer structured logrus fields for GC error logging.At Line 89, use structured logging instead of
Warnfso error data is consistently queryable.Suggested refactor
- logrus.Warnf("Garbage collect stale network resources failed: %v", err) + logrus.WithError(err).Warn("Garbage collect stale network resources failed")As per coding guidelines, "Use logrus with structured fields for logging in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 88 - 90, Replace the unstructured Warnf call that logs GC errors with a structured logrus entry: when c.doGC(ctx) returns an error, call logrus.WithError(err).Warn("Garbage collect stale network resources failed") (or use logrus.WithFields(logrus.Fields{"error": err}).Warn(...)) so the c.doGC failure is logged using structured fields; update the block that currently uses logrus.Warnf to use logrus.WithError(err).Warn and keep the existing message text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 94-99: The watcher channels are treated as persistent but must be
terminal: update AddWatcher() to NOT append new watchers when c.lastError != nil
(including errShutdown) so callers don't get parked after shutdown, and change
the notification logic in the loop that iterates c.watchers (and the similar
block around lines 136-155) to treat channels as one‑time: send the
readiness/shutdown signal non-blockingly to each watcher (select { case watcher
<- true: default: }), then remove/clear that watcher from the slice (or set
c.watchers = nil) so stale channels are not revisited; ensure this uses the
existing symbols AddWatcher, c.watchers, c.lastError and errShutdown so behavior
is consistent across readiness and shutdown paths.
- Around line 72-76: The statusPollFunc currently calls c.plugin.Status() while
holding c.mutex and without passing ctx; change it to avoid calling the blocking
status under the lock and to use a context-aware API: grab a local reference to
the plugin (e.g., p := c.plugin) while holding c.mutex, release the lock, then
call the context-aware status method (preferably p.StatusWithContext(ctx) if
available, or add/use a StatusWithContext wrapper on the CNI plugin) and only
re-acquire the lock to update manager state based on the result; update
statusPollFunc, and ensure callers like ReadyOrError(), AddWatcher(), and
Shutdown() are not blocked by the status call.
---
Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 88-90: Replace the unstructured Warnf call that logs GC errors
with a structured logrus entry: when c.doGC(ctx) returns an error, call
logrus.WithError(err).Warn("Garbage collect stale network resources failed") (or
use logrus.WithFields(logrus.Fields{"error": err}).Warn(...)) so the c.doGC
failure is logged using structured fields; update the block that currently uses
logrus.Warnf to use logrus.WithError(err).Warn and keep the existing message
text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bab20aa6-07ff-467c-98dd-d8b5fc57e3bd
📒 Files selected for processing (3)
internal/config/cnimgr/cnimgr.gointernal/config/cnimgr/cnimgr_test.gointernal/config/cnimgr/cnimgr_test_inject.go
✅ Files skipped from review due to trivial changes (1)
- internal/config/cnimgr/cnimgr_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/config/cnimgr/cnimgr_test_inject.go
09adc49 to
a73153a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/config/cnimgr/cnimgr.go (1)
72-100:⚠️ Potential issue | 🟠 MajorFollow-up: external CNI work still runs under
c.mutex.
statusPollFunckeeps the write lock held acrossStatusWithContext,validPodList(), andplugin.GC(). If any of those block,ReadyOrError()on the kubelet status path (server/runtime_status.go:1-20) and watcher registration inserver/sandbox_network.go:367-378stall behind the poller. Please narrow the critical section to state transitions and watcher handoff, and run the external work outside the mutex.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 72 - 100, statusPollFunc currently holds c.mutex while calling external/blocking work (plugin.StatusWithContext, validPodList/doGC), causing other paths to stall; change it so the mutex is only held to read/update internal state and to hand off watchers. Concretely: acquire c.mutex to snapshot any required state (e.g. c.lastError, copy of c.watchers) and then release the mutex before calling plugin.StatusWithContext, validPodList(), or doGC; perform those calls and compute the new state outside the lock, then re-acquire c.mutex to update c.lastError, clear/replace c.watchers and perform the non-blocking watcher sends. Ensure you use local variables for results and only mutate CNIManager fields (c.lastError, c.watchers) while holding c.mutex to keep transitions atomic.
🧹 Nitpick comments (2)
internal/config/cnimgr/cnimgr.go (2)
41-49: Pass a parent context into the poller.The manager now owns a long-lived goroutine, but it is rooted in
context.Background(), so its lifetime is detached from the caller andShutdown()becomes the only cancellation path. Please accept a parentcontext.Contexthere and derive the internal cancelable context from that instead.As per coding guidelines, "Propagate context.Context through function calls in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 41 - 49, The poller is created from context.Background() so its lifetime is detached; change the constructor that creates CNIManager to accept a parent context (e.g. ctx context.Context) and derive the cancellable context with context.WithCancel(parentCtx) instead of context.Background(), then pass that derived ctx into mgr.pollContinuously(ctx); keep the existing mgr.cancel and Shutdown() behavior so Shutdown still calls mgr.cancel() but ensure callers supply the parent context when creating the CNIManager (update the function signature that instantiates CNIManager accordingly).
88-90: Log GC failures as structured fields.Please keep the error as a field here rather than formatting it into the message.
As per coding guidelines, "Use logrus with structured fields for logging in Go code".📝 Suggested change
- logrus.Warnf("Garbage collect stale network resources failed: %v", err) + logrus.WithError(err).Warn("Garbage collect stale network resources failed")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 88 - 90, The log currently formats the error into the message using logrus.Warnf after calling c.doGC(ctx); change it to use structured logging so the error is attached as a field (e.g., use logrus.WithError(err).Warn(...) or logrus.WithField("err", err).Warn(...)) and keep the original descriptive text "Garbage collect stale network resources failed" as the message; update the error handling where c.doGC(ctx) is called and replace the logrus.Warnf invocation accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/config/cnimgr/cnimgr_test.go`:
- Around line 62-72: newTestManager currently starts the immediate Phase 1 poll
by launching go mgr.pollContinuously(ctx) which races tests that set
mgr.validPodList; change newTestManager to construct and return the CNIManager
without starting the goroutine, add a small Start or StartPolling method (e.g.,
CNIManager.Start(ctx) or CNIManager.StartPolling) that calls pollContinuously,
and update tests to set mgr.validPodList before calling Start; refer to
newTestManager, CNIManager.pollContinuously, Start/StartPolling (new) and
validPodList when making the change.
---
Duplicate comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 72-100: statusPollFunc currently holds c.mutex while calling
external/blocking work (plugin.StatusWithContext, validPodList/doGC), causing
other paths to stall; change it so the mutex is only held to read/update
internal state and to hand off watchers. Concretely: acquire c.mutex to snapshot
any required state (e.g. c.lastError, copy of c.watchers) and then release the
mutex before calling plugin.StatusWithContext, validPodList(), or doGC; perform
those calls and compute the new state outside the lock, then re-acquire c.mutex
to update c.lastError, clear/replace c.watchers and perform the non-blocking
watcher sends. Ensure you use local variables for results and only mutate
CNIManager fields (c.lastError, c.watchers) while holding c.mutex to keep
transitions atomic.
---
Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 41-49: The poller is created from context.Background() so its
lifetime is detached; change the constructor that creates CNIManager to accept a
parent context (e.g. ctx context.Context) and derive the cancellable context
with context.WithCancel(parentCtx) instead of context.Background(), then pass
that derived ctx into mgr.pollContinuously(ctx); keep the existing mgr.cancel
and Shutdown() behavior so Shutdown still calls mgr.cancel() but ensure callers
supply the parent context when creating the CNIManager (update the function
signature that instantiates CNIManager accordingly).
- Around line 88-90: The log currently formats the error into the message using
logrus.Warnf after calling c.doGC(ctx); change it to use structured logging so
the error is attached as a field (e.g., use logrus.WithError(err).Warn(...) or
logrus.WithField("err", err).Warn(...)) and keep the original descriptive text
"Garbage collect stale network resources failed" as the message; update the
error handling where c.doGC(ctx) is called and replace the logrus.Warnf
invocation accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 84e0db6d-36dd-476e-a3a3-e4b7fe52880a
📒 Files selected for processing (3)
internal/config/cnimgr/cnimgr.gointernal/config/cnimgr/cnimgr_test.gointernal/config/cnimgr/cnimgr_test_inject.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/config/cnimgr/cnimgr_test_inject.go
433918e to
5eb101c
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/config/cnimgr/cnimgr_test.go (1)
62-72:⚠️ Potential issue | 🟡 Minor
newTestManagerstill races startup readiness in GC tests.Because polling starts immediately, subtests that assign
mgr.validPodListafter construction can miss the first ready transition and intermittently skip the GC call they assert on.💡 Deterministic helper pattern
-func newTestManager(plugin *fakeCNIPlugin) *CNIManager { +func newTestManager(plugin *fakeCNIPlugin, preStart func(*CNIManager)) *CNIManager { ctx, cancel := context.WithCancel(context.Background()) mgr := &CNIManager{ plugin: plugin, lastError: errors.New("plugin status uninitialized"), cancel: cancel, initPollInterval: testPollInterval, monitorPollInterval: testPollInterval, } + if preStart != nil { + preStart(mgr) + } go mgr.pollContinuously(ctx) return mgr }Then GC tests can install
validPodListinpreStartbefore polling begins.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr_test.go` around lines 62 - 72, newTestManager starts the pollContinuously goroutine immediately which races tests that want to set mgr.validPodList before polling begins; change newTestManager to not start polling itself (remove the go mgr.pollContinuously(ctx) call) and instead provide a deterministic start hook (e.g., add a Start/BeginPolling method or expose PollContinuouslyStart(ctx) that calls mgr.pollContinuously) so tests can set mgr.validPodList in their preStart setup and then invoke mgr.Start(ctx) to begin polling; update tests to call the new start method after installing validPodList and keep references to pollContinuously and validPodList to locate the affected code.
🧹 Nitpick comments (1)
internal/config/cnimgr/cnimgr.go (1)
89-90: Use structured logrus fields for error logging.Replace
logrus.Warnf()withlogrus.WithError(err).Warn()to align with structured logging conventions and enable better queryability in logs.Suggested change
- if err := c.doGC(ctx); err != nil { - logrus.Warnf("Garbage collect stale network resources failed: %v", err) - } + if err := c.doGC(ctx); err != nil { + logrus.WithError(err).Warn("Garbage collect stale network resources failed") + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/cnimgr/cnimgr.go` around lines 89 - 90, Replace the unstructured log call logrus.Warnf("Garbage collect stale network resources failed: %v", err) with a structured log using the error field: use logrus.WithError(err).Warn("Garbage collect stale network resources failed") so the same message is logged but the error is attached as structured data; update the occurrence where logrus.Warnf is called and ensure the err variable is passed into WithError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/config/cnimgr/cnimgr_test.go`:
- Around line 62-72: newTestManager starts the pollContinuously goroutine
immediately which races tests that want to set mgr.validPodList before polling
begins; change newTestManager to not start polling itself (remove the go
mgr.pollContinuously(ctx) call) and instead provide a deterministic start hook
(e.g., add a Start/BeginPolling method or expose PollContinuouslyStart(ctx) that
calls mgr.pollContinuously) so tests can set mgr.validPodList in their preStart
setup and then invoke mgr.Start(ctx) to begin polling; update tests to call the
new start method after installing validPodList and keep references to
pollContinuously and validPodList to locate the affected code.
---
Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 89-90: Replace the unstructured log call logrus.Warnf("Garbage
collect stale network resources failed: %v", err) with a structured log using
the error field: use logrus.WithError(err).Warn("Garbage collect stale network
resources failed") so the same message is logged but the error is attached as
structured data; update the occurrence where logrus.Warnf is called and ensure
the err variable is passed into WithError.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a9c96217-119c-4261-ac6b-5327fc226cc2
📒 Files selected for processing (3)
internal/config/cnimgr/cnimgr.gointernal/config/cnimgr/cnimgr_test.gointernal/config/cnimgr/cnimgr_test_inject.go
| c.lastError = nil | ||
| for _, watcher := range c.watchers { | ||
| watcher <- true | ||
| if c.lastError != nil { |
There was a problem hiding this comment.
hm I'm not sure we still want to do the GC if the plugin went down and came back up, but rather just once on startup. I think this should only happen on stopOnReady (or, maybe better, we should pull this section out and add it after we've done phase 1 but before we start phase 2, assuming they're done synchronously?)
There was a problem hiding this comment.
oh wait no we need to do gc before watchers, ignore the second option
There was a problem hiding this comment.
sure, i ll change a name for stopOnReady, will not match in this case
There was a problem hiding this comment.
Can you update the PR description as well for posterity?
6f837fd to
d0a8aa8
Compare
|
/approve LGTM, @cri-o/cri-o-maintainers PTAL |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, tsorya The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/ok-to-test |
|
@tsorya can you update the PR description to include release note? (see PR template) |
d0a8aa8 to
451a735
Compare
60857a0 to
aac7feb
Compare
|
/retest-required |
1 similar comment
|
/retest-required |
|
/retest |
aac7feb to
46e4be2
Compare
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9855 +/- ##
==========================================
- Coverage 67.57% 67.40% -0.17%
==========================================
Files 214 215 +1
Lines 29607 29657 +50
==========================================
- Hits 20007 19991 -16
- Misses 7903 7962 +59
- Partials 1697 1704 +7 🚀 New features to boost your workflow:
|
|
go: go.mod requires go >= 1.26.2 (running go 1.26.0; GOTOOLCHAIN=local) seems not relevant to the change |
46e4be2 to
bbec2e9
Compare
|
/retest |
1 similar comment
|
/retest |
bbec2e9 to
179ae13
Compare
|
/retest |
179ae13 to
954493d
Compare
|
@tsorya |
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <itsoiref@redhat.com> Made-with: Cursor
954493d to
8ff525c
Compare
|
/retest |
1 similar comment
|
/retest |
|
/lgtm |
|
/retest |
1 similar comment
|
/retest |
|
/cherry-pick release-1.35 |
|
@haircommander: #9855 failed to apply on top of branch "release-1.35": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.
This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).
This PR changes the polling to two stages:
notifies watchers. Same as before.
failure, self-heals on recovery.
Also replaces the shutdown bool with context cancellation for clean
goroutine lifecycle, and handles edge cases in AddWatcher for shutdown
and already-ready states.
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/OCPBUGS-81474
Special notes for your reviewer:
The
doGCcall is still made under the mutex (inherited from theoriginal code). This could be improved in a follow-up by splitting the
lock around the I/O-heavy GC operation.
Does this PR introduce a user-facing change?