Skip to content

cnimgr: continuously poll CNI STATUS to detect runtime health changes#9855

Merged
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
tsorya:continuous-cni-status-poll
Apr 20, 2026
Merged

cnimgr: continuously poll CNI STATUS to detect runtime health changes#9855
openshift-merge-bot[bot] merged 1 commit into
cri-o:mainfrom
tsorya:continuous-cni-status-poll

Conversation

@tsorya

@tsorya tsorya commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

This PR changes the polling to two stages:

  • Init poll (500ms): fast poll until first ready, triggers GC and
    notifies watchers. Same as before.
  • Monitor poll (5s): continuous health check. Sets not-ready on
    failure, self-heals on recovery.

Also replaces the shutdown bool with context cancellation for clean
goroutine lifecycle, and handles edge cases in AddWatcher for shutdown
and already-ready states.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-81474

Special notes for your reviewer:

The doGC call is still made under the mutex (inherited from the
original code). This could be improved in a follow-up by splitting the
lock around the I/O-heavy GC operation.

Does this PR introduce a user-facing change?

CRI-O now continuously monitors CNI plugin health using the STATUS
verb. If a plugin becomes unhealthy after initial readiness, the node
is reported as NetworkReady=false, preventing pod scheduling on
affected nodes. The node self-heals when the plugin recovers.

@tsorya tsorya requested a review from mrunalp as a code owner March 28, 2026 03:28
@openshift-ci openshift-ci Bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 28, 2026
@openshift-ci

openshift-ci Bot commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

Hi @tsorya. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from bitoku and littlejawa March 28, 2026 03:28
@coderabbitai

coderabbitai Bot commented Mar 28, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaced the startup poll with a cancellable two-phase continuous poll (fast readiness, slow health monitor), added errShutdown, refactored status checks into statusPollFunc, changed watcher semantics and Shutdown behavior, and added tests for readiness, flapping, GC, and watchers.

Changes

Cohort / File(s) Summary
Core Polling Refactor
internal/config/cnimgr/cnimgr.go
Added errShutdown, replaced boolean shutdown with cancellable polling context, introduced initPollInterval/monitorPollInterval, replaced pollFunc/pollUntilReady with pollContinuously(ctx) using statusPollFunc(ctx, stopOnReady), adjusted watcher handling (non-blocking notifies, clearing), updated Shutdown() to cancel context and set lastError = errShutdown, and updated GC defer comments.
New Tests
internal/config/cnimgr/cnimgr_test.go
New test suite with fakeCNIPlugin, newTestManager, and TestStatusPolling covering readiness transitions, flapping, watcher notifications (immediate and on transitions, non-blocking), GC invocation behavior, and shutdown semantics.
Test Injection Adjust
internal/config/cnimgr/cnimgr_test_inject.go
Test-only change: SetCNIPlugin now invokes c.statusPollFunc(context.Background(), false) (with //nolint:errcheck) instead of calling c.pollFunc(); added context import. No public API changes.

Sequence Diagram

sequenceDiagram
    participant Client
    participant CNIManager
    participant FastPhase as Phase 1 (Fast Readiness)
    participant SlowPhase as Phase 2 (Health Monitor)
    participant Plugin
    participant GC
    participant Watchers

    Client->>CNIManager: Start polling
    CNIManager->>FastPhase: pollContinuously(ctx) -> start fast phase

    loop Fast Phase (short interval)
        FastPhase->>Plugin: StatusWithContext(ctx)
        alt healthy
            Plugin-->>FastPhase: nil
            FastPhase->>GC: doGC(ctx)
            GC-->>FastPhase: done
            FastPhase->>Watchers: send true (non-blocking)
            FastPhase->>SlowPhase: transition to slow monitoring
        else unhealthy
            Plugin-->>FastPhase: error
            FastPhase->>FastPhase: continue fast polling
        end
    end

    loop Slow Phase (long interval)
        SlowPhase->>Plugin: StatusWithContext(ctx)
        alt recovery (error -> nil)
            Plugin-->>SlowPhase: nil
            SlowPhase->>GC: doGC(ctx)
            SlowPhase->>Watchers: send true (non-blocking)
        else failure (nil -> error)
            Plugin-->>SlowPhase: error
            SlowPhase->>Watchers: send false (non-blocking)
        else unchanged
            SlowPhase->>SlowPhase: continue monitoring
        end
    end

    Client->>CNIManager: Shutdown()
    CNIManager->>CNIManager: cancel polling context
    CNIManager->>CNIManager: lastError = errShutdown
    CNIManager->>Watchers: send false (non-blocking) and clear watchers
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through short and slow patrols,

I peeked at plugins, counted holes,
When health returned I swept the floor,
I nudged the watchers, then no more,
Now quiet paws rest after roles.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: introducing continuous CNI STATUS polling to detect runtime health changes, which is the core objective of this PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from 4cb7a9b to 09adc49 Compare March 28, 2026 03:33

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/config/cnimgr/cnimgr_test_inject.go`:
- Around line 23-24: The injected test call passes a nil context into
c.statusPollFunc which can lead to a panic when statusPollFunc invokes doGC that
uses context.WithTimeout; update the test to pass a real context (e.g.,
context.Background() or context.TODO()) into the call to c.statusPollFunc
instead of nil so statusPollFunc and doGC receive a valid context.

In `@internal/config/cnimgr/cnimgr.go`:
- Around line 91-99: When broadcasting readiness/shutdown, treat watcher
channels as terminal: in the loop that iterates c.watchers (the non-blocking
send shown) send the signal, close each watcher channel, and then clear
c.watchers so stale waiters are not revisited; additionally modify AddWatcher()
to check c.lastError (and errShutdown) under the same lock and if set return a
channel that is already signaled/closed instead of appending a new watcher to
c.watchers. This ensures Shutdown()/readiness broadcasts fully terminate
watchers and prevents callers from being appended to a canceled poller and
blocking forever.
- Around line 54-60: In pollContinuously, statusPollFunc is calling
c.plugin.Status() without the request context which can block while holding the
mutex; update statusPollFunc (used in pollContinuously) to call
c.plugin.StatusWithContext(ctx) and pass the provided ctx into that call (and do
the same for the second-phase polling lambda), so status checks honor
cancellation and do not stall ReadyOrError, AddWatcher, or Shutdown; locate the
statusPollFunc closures referenced in pollContinuously and replace Status() with
StatusWithContext(ctx).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c53c07ed-0ac5-43e2-8c89-a60320631769

📥 Commits

Reviewing files that changed from the base of the PR and between 5753c52 and 4cb7a9b.

📒 Files selected for processing (3)
  • internal/config/cnimgr/cnimgr.go
  • internal/config/cnimgr/cnimgr_test.go
  • internal/config/cnimgr/cnimgr_test_inject.go

Comment thread internal/config/cnimgr/cnimgr_test_inject.go Outdated
Comment thread internal/config/cnimgr/cnimgr.go
Comment thread internal/config/cnimgr/cnimgr.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
internal/config/cnimgr/cnimgr.go (2)

94-99: ⚠️ Potential issue | 🔴 Critical

Treat watcher channels as terminal after readiness/shutdown.

AddWatcher() currently appends when lastError != nil, including errShutdown (Line 127), so callers can get parked forever after shutdown. Also, watcher lists are not cleared after notifications (Lines 94-99 and 150-155), so stale channels are revisited repeatedly.

Suggested fix
 func (c *CNIManager) AddWatcher() chan bool {
 	c.mutex.Lock()
 	defer c.mutex.Unlock()

 	watcher := make(chan bool, 1)
-	if c.lastError == nil {
+	switch {
+	case c.lastError == nil:
 		watcher <- true
-	} else {
+		close(watcher)
+	case errors.Is(c.lastError, errShutdown):
+		watcher <- false
+		close(watcher)
+	default:
 		c.watchers = append(c.watchers, watcher)
 	}

 	return watcher
 }
 		for _, watcher := range c.watchers {
 			select {
 			case watcher <- true:
 			default:
 			}
+			close(watcher)
 		}
+		c.watchers = nil
 	}
 	for _, watcher := range c.watchers {
 		select {
 		case watcher <- false:
 		default:
 		}
+		close(watcher)
 	}
+	c.watchers = nil
 }

Also applies to: 122-133, 136-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 94 - 99, The watcher channels
are treated as persistent but must be terminal: update AddWatcher() to NOT
append new watchers when c.lastError != nil (including errShutdown) so callers
don't get parked after shutdown, and change the notification logic in the loop
that iterates c.watchers (and the similar block around lines 136-155) to treat
channels as one‑time: send the readiness/shutdown signal non-blockingly to each
watcher (select { case watcher <- true: default: }), then remove/clear that
watcher from the slice (or set c.watchers = nil) so stale channels are not
revisited; ensure this uses the existing symbols AddWatcher, c.watchers,
c.lastError and errShutdown so behavior is consistent across readiness and
shutdown paths.

72-76: ⚠️ Potential issue | 🟠 Major

Use context-aware CNI status checks (and avoid status call under the lock).

At Line 76, c.plugin.Status() ignores the provided ctx. If status blocks, it can hold the write lock longer than necessary and stall ReadyOrError(), AddWatcher(), and Shutdown() paths.

Suggested fix
 func (c *CNIManager) statusPollFunc(ctx context.Context, stopOnReady bool) (bool, error) {
-	c.mutex.Lock()
-	defer c.mutex.Unlock()
-
-	if err := c.plugin.Status(); err != nil {
+	if err := c.plugin.StatusWithContext(ctx); err != nil {
+		c.mutex.Lock()
 		c.lastError = err
+		c.mutex.Unlock()
 		return false, nil
 	}
 
+	c.mutex.Lock()
+	defer c.mutex.Unlock()
+
 	if c.lastError != nil {
 		c.lastError = nil
#!/bin/bash
set -euo pipefail
# Verify ocicni interface methods and current cnimgr usage.
rg -n -C2 'type CNIPlugin interface|StatusWithContext\(|Status\(' \
  vendor/github.com/cri-o/ocicni/pkg/ocicni/types.go \
  internal/config/cnimgr/cnimgr.go

As per coding guidelines, "Propagate context.Context through function calls in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 72 - 76, The statusPollFunc
currently calls c.plugin.Status() while holding c.mutex and without passing ctx;
change it to avoid calling the blocking status under the lock and to use a
context-aware API: grab a local reference to the plugin (e.g., p := c.plugin)
while holding c.mutex, release the lock, then call the context-aware status
method (preferably p.StatusWithContext(ctx) if available, or add/use a
StatusWithContext wrapper on the CNI plugin) and only re-acquire the lock to
update manager state based on the result; update statusPollFunc, and ensure
callers like ReadyOrError(), AddWatcher(), and Shutdown() are not blocked by the
status call.
🧹 Nitpick comments (1)
internal/config/cnimgr/cnimgr.go (1)

88-90: Prefer structured logrus fields for GC error logging.

At Line 89, use structured logging instead of Warnf so error data is consistently queryable.

Suggested refactor
-			logrus.Warnf("Garbage collect stale network resources failed: %v", err)
+			logrus.WithError(err).Warn("Garbage collect stale network resources failed")

As per coding guidelines, "Use logrus with structured fields for logging in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 88 - 90, Replace the
unstructured Warnf call that logs GC errors with a structured logrus entry: when
c.doGC(ctx) returns an error, call logrus.WithError(err).Warn("Garbage collect
stale network resources failed") (or use
logrus.WithFields(logrus.Fields{"error": err}).Warn(...)) so the c.doGC failure
is logged using structured fields; update the block that currently uses
logrus.Warnf to use logrus.WithError(err).Warn and keep the existing message
text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 94-99: The watcher channels are treated as persistent but must be
terminal: update AddWatcher() to NOT append new watchers when c.lastError != nil
(including errShutdown) so callers don't get parked after shutdown, and change
the notification logic in the loop that iterates c.watchers (and the similar
block around lines 136-155) to treat channels as one‑time: send the
readiness/shutdown signal non-blockingly to each watcher (select { case watcher
<- true: default: }), then remove/clear that watcher from the slice (or set
c.watchers = nil) so stale channels are not revisited; ensure this uses the
existing symbols AddWatcher, c.watchers, c.lastError and errShutdown so behavior
is consistent across readiness and shutdown paths.
- Around line 72-76: The statusPollFunc currently calls c.plugin.Status() while
holding c.mutex and without passing ctx; change it to avoid calling the blocking
status under the lock and to use a context-aware API: grab a local reference to
the plugin (e.g., p := c.plugin) while holding c.mutex, release the lock, then
call the context-aware status method (preferably p.StatusWithContext(ctx) if
available, or add/use a StatusWithContext wrapper on the CNI plugin) and only
re-acquire the lock to update manager state based on the result; update
statusPollFunc, and ensure callers like ReadyOrError(), AddWatcher(), and
Shutdown() are not blocked by the status call.

---

Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 88-90: Replace the unstructured Warnf call that logs GC errors
with a structured logrus entry: when c.doGC(ctx) returns an error, call
logrus.WithError(err).Warn("Garbage collect stale network resources failed") (or
use logrus.WithFields(logrus.Fields{"error": err}).Warn(...)) so the c.doGC
failure is logged using structured fields; update the block that currently uses
logrus.Warnf to use logrus.WithError(err).Warn and keep the existing message
text.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bab20aa6-07ff-467c-98dd-d8b5fc57e3bd

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb7a9b and 09adc49.

📒 Files selected for processing (3)
  • internal/config/cnimgr/cnimgr.go
  • internal/config/cnimgr/cnimgr_test.go
  • internal/config/cnimgr/cnimgr_test_inject.go
✅ Files skipped from review due to trivial changes (1)
  • internal/config/cnimgr/cnimgr_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/config/cnimgr/cnimgr_test_inject.go

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from 09adc49 to a73153a Compare March 29, 2026 03:43

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/config/cnimgr/cnimgr.go (1)

72-100: ⚠️ Potential issue | 🟠 Major

Follow-up: external CNI work still runs under c.mutex.

statusPollFunc keeps the write lock held across StatusWithContext, validPodList(), and plugin.GC(). If any of those block, ReadyOrError() on the kubelet status path (server/runtime_status.go:1-20) and watcher registration in server/sandbox_network.go:367-378 stall behind the poller. Please narrow the critical section to state transitions and watcher handoff, and run the external work outside the mutex.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 72 - 100, statusPollFunc
currently holds c.mutex while calling external/blocking work
(plugin.StatusWithContext, validPodList/doGC), causing other paths to stall;
change it so the mutex is only held to read/update internal state and to hand
off watchers. Concretely: acquire c.mutex to snapshot any required state (e.g.
c.lastError, copy of c.watchers) and then release the mutex before calling
plugin.StatusWithContext, validPodList(), or doGC; perform those calls and
compute the new state outside the lock, then re-acquire c.mutex to update
c.lastError, clear/replace c.watchers and perform the non-blocking watcher
sends. Ensure you use local variables for results and only mutate CNIManager
fields (c.lastError, c.watchers) while holding c.mutex to keep transitions
atomic.
🧹 Nitpick comments (2)
internal/config/cnimgr/cnimgr.go (2)

41-49: Pass a parent context into the poller.

The manager now owns a long-lived goroutine, but it is rooted in context.Background(), so its lifetime is detached from the caller and Shutdown() becomes the only cancellation path. Please accept a parent context.Context here and derive the internal cancelable context from that instead.

As per coding guidelines, "Propagate context.Context through function calls in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 41 - 49, The poller is created
from context.Background() so its lifetime is detached; change the constructor
that creates CNIManager to accept a parent context (e.g. ctx context.Context)
and derive the cancellable context with context.WithCancel(parentCtx) instead of
context.Background(), then pass that derived ctx into mgr.pollContinuously(ctx);
keep the existing mgr.cancel and Shutdown() behavior so Shutdown still calls
mgr.cancel() but ensure callers supply the parent context when creating the
CNIManager (update the function signature that instantiates CNIManager
accordingly).

88-90: Log GC failures as structured fields.

Please keep the error as a field here rather than formatting it into the message.

📝 Suggested change
-			logrus.Warnf("Garbage collect stale network resources failed: %v", err)
+			logrus.WithError(err).Warn("Garbage collect stale network resources failed")
As per coding guidelines, "Use logrus with structured fields for logging in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 88 - 90, The log currently
formats the error into the message using logrus.Warnf after calling c.doGC(ctx);
change it to use structured logging so the error is attached as a field (e.g.,
use logrus.WithError(err).Warn(...) or logrus.WithField("err", err).Warn(...))
and keep the original descriptive text "Garbage collect stale network resources
failed" as the message; update the error handling where c.doGC(ctx) is called
and replace the logrus.Warnf invocation accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/config/cnimgr/cnimgr_test.go`:
- Around line 62-72: newTestManager currently starts the immediate Phase 1 poll
by launching go mgr.pollContinuously(ctx) which races tests that set
mgr.validPodList; change newTestManager to construct and return the CNIManager
without starting the goroutine, add a small Start or StartPolling method (e.g.,
CNIManager.Start(ctx) or CNIManager.StartPolling) that calls pollContinuously,
and update tests to set mgr.validPodList before calling Start; refer to
newTestManager, CNIManager.pollContinuously, Start/StartPolling (new) and
validPodList when making the change.

---

Duplicate comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 72-100: statusPollFunc currently holds c.mutex while calling
external/blocking work (plugin.StatusWithContext, validPodList/doGC), causing
other paths to stall; change it so the mutex is only held to read/update
internal state and to hand off watchers. Concretely: acquire c.mutex to snapshot
any required state (e.g. c.lastError, copy of c.watchers) and then release the
mutex before calling plugin.StatusWithContext, validPodList(), or doGC; perform
those calls and compute the new state outside the lock, then re-acquire c.mutex
to update c.lastError, clear/replace c.watchers and perform the non-blocking
watcher sends. Ensure you use local variables for results and only mutate
CNIManager fields (c.lastError, c.watchers) while holding c.mutex to keep
transitions atomic.

---

Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 41-49: The poller is created from context.Background() so its
lifetime is detached; change the constructor that creates CNIManager to accept a
parent context (e.g. ctx context.Context) and derive the cancellable context
with context.WithCancel(parentCtx) instead of context.Background(), then pass
that derived ctx into mgr.pollContinuously(ctx); keep the existing mgr.cancel
and Shutdown() behavior so Shutdown still calls mgr.cancel() but ensure callers
supply the parent context when creating the CNIManager (update the function
signature that instantiates CNIManager accordingly).
- Around line 88-90: The log currently formats the error into the message using
logrus.Warnf after calling c.doGC(ctx); change it to use structured logging so
the error is attached as a field (e.g., use logrus.WithError(err).Warn(...) or
logrus.WithField("err", err).Warn(...)) and keep the original descriptive text
"Garbage collect stale network resources failed" as the message; update the
error handling where c.doGC(ctx) is called and replace the logrus.Warnf
invocation accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 84e0db6d-36dd-476e-a3a3-e4b7fe52880a

📥 Commits

Reviewing files that changed from the base of the PR and between 09adc49 and a73153a.

📒 Files selected for processing (3)
  • internal/config/cnimgr/cnimgr.go
  • internal/config/cnimgr/cnimgr_test.go
  • internal/config/cnimgr/cnimgr_test_inject.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/config/cnimgr/cnimgr_test_inject.go

Comment thread internal/config/cnimgr/cnimgr_test.go
@tsorya tsorya force-pushed the continuous-cni-status-poll branch 5 times, most recently from 433918e to 5eb101c Compare April 7, 2026 21:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/config/cnimgr/cnimgr_test.go (1)

62-72: ⚠️ Potential issue | 🟡 Minor

newTestManager still races startup readiness in GC tests.

Because polling starts immediately, subtests that assign mgr.validPodList after construction can miss the first ready transition and intermittently skip the GC call they assert on.

💡 Deterministic helper pattern
-func newTestManager(plugin *fakeCNIPlugin) *CNIManager {
+func newTestManager(plugin *fakeCNIPlugin, preStart func(*CNIManager)) *CNIManager {
 	ctx, cancel := context.WithCancel(context.Background())
 	mgr := &CNIManager{
 		plugin:         plugin,
 		lastError:      errors.New("plugin status uninitialized"),
 		cancel:         cancel,
 		initPollInterval:    testPollInterval,
 		monitorPollInterval: testPollInterval,
 	}
+	if preStart != nil {
+		preStart(mgr)
+	}
 	go mgr.pollContinuously(ctx)
 	return mgr
}

Then GC tests can install validPodList in preStart before polling begins.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr_test.go` around lines 62 - 72, newTestManager
starts the pollContinuously goroutine immediately which races tests that want to
set mgr.validPodList before polling begins; change newTestManager to not start
polling itself (remove the go mgr.pollContinuously(ctx) call) and instead
provide a deterministic start hook (e.g., add a Start/BeginPolling method or
expose PollContinuouslyStart(ctx) that calls mgr.pollContinuously) so tests can
set mgr.validPodList in their preStart setup and then invoke mgr.Start(ctx) to
begin polling; update tests to call the new start method after installing
validPodList and keep references to pollContinuously and validPodList to locate
the affected code.
🧹 Nitpick comments (1)
internal/config/cnimgr/cnimgr.go (1)

89-90: Use structured logrus fields for error logging.

Replace logrus.Warnf() with logrus.WithError(err).Warn() to align with structured logging conventions and enable better queryability in logs.

Suggested change
-		if err := c.doGC(ctx); err != nil {
-			logrus.Warnf("Garbage collect stale network resources failed: %v", err)
-		}
+		if err := c.doGC(ctx); err != nil {
+			logrus.WithError(err).Warn("Garbage collect stale network resources failed")
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 89 - 90, Replace the
unstructured log call logrus.Warnf("Garbage collect stale network resources
failed: %v", err) with a structured log using the error field: use
logrus.WithError(err).Warn("Garbage collect stale network resources failed") so
the same message is logged but the error is attached as structured data; update
the occurrence where logrus.Warnf is called and ensure the err variable is
passed into WithError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/config/cnimgr/cnimgr_test.go`:
- Around line 62-72: newTestManager starts the pollContinuously goroutine
immediately which races tests that want to set mgr.validPodList before polling
begins; change newTestManager to not start polling itself (remove the go
mgr.pollContinuously(ctx) call) and instead provide a deterministic start hook
(e.g., add a Start/BeginPolling method or expose PollContinuouslyStart(ctx) that
calls mgr.pollContinuously) so tests can set mgr.validPodList in their preStart
setup and then invoke mgr.Start(ctx) to begin polling; update tests to call the
new start method after installing validPodList and keep references to
pollContinuously and validPodList to locate the affected code.

---

Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 89-90: Replace the unstructured log call logrus.Warnf("Garbage
collect stale network resources failed: %v", err) with a structured log using
the error field: use logrus.WithError(err).Warn("Garbage collect stale network
resources failed") so the same message is logged but the error is attached as
structured data; update the occurrence where logrus.Warnf is called and ensure
the err variable is passed into WithError.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9c96217-119c-4261-ac6b-5327fc226cc2

📥 Commits

Reviewing files that changed from the base of the PR and between 5de4f2b and 433918e.

📒 Files selected for processing (3)
  • internal/config/cnimgr/cnimgr.go
  • internal/config/cnimgr/cnimgr_test.go
  • internal/config/cnimgr/cnimgr_test_inject.go

c.lastError = nil
for _, watcher := range c.watchers {
watcher <- true
if c.lastError != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm I'm not sure we still want to do the GC if the plugin went down and came back up, but rather just once on startup. I think this should only happen on stopOnReady (or, maybe better, we should pull this section out and add it after we've done phase 1 but before we start phase 2, assuming they're done synchronously?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait no we need to do gc before watchers, ignore the second option

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i ll change a name for stopOnReady, will not match in this case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the PR description as well for posterity?

@tsorya tsorya force-pushed the continuous-cni-status-poll branch 2 times, most recently from 6f837fd to d0a8aa8 Compare April 9, 2026 02:04
@haircommander

Copy link
Copy Markdown
Member

/approve

LGTM, @cri-o/cri-o-maintainers PTAL

@openshift-ci

openshift-ci Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, tsorya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2026
@bitoku

bitoku commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 13, 2026
@bitoku

bitoku commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

@tsorya can you update the PR description to include release note? (see PR template)

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from d0a8aa8 to 451a735 Compare April 13, 2026 17:37
@openshift-ci openshift-ci Bot removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 14, 2026
@tsorya tsorya force-pushed the continuous-cni-status-poll branch from 60857a0 to aac7feb Compare April 14, 2026 13:23
@tsorya

tsorya commented Apr 14, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

1 similar comment
@tsorya

tsorya commented Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

@bitoku

bitoku commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

/retest

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from aac7feb to 46e4be2 Compare April 15, 2026 13:37
@tsorya

tsorya commented Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@codecov

codecov Bot commented Apr 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.39394% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.40%. Comparing base (c4905cc) to head (8ff525c).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9855      +/-   ##
==========================================
- Coverage   67.57%   67.40%   -0.17%     
==========================================
  Files         214      215       +1     
  Lines       29607    29657      +50     
==========================================
- Hits        20007    19991      -16     
- Misses       7903     7962      +59     
- Partials     1697     1704       +7     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tsorya

tsorya commented Apr 16, 2026

Copy link
Copy Markdown
Contributor Author

go: go.mod requires go >= 1.26.2 (running go 1.26.0; GOTOOLCHAIN=local)
make: *** [Makefile:101: /tmp/tmp.N7UFu2Rj1s/build/bin/linux/amd64/critest] Error 1

seems not relevant to the change

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from 46e4be2 to bbec2e9 Compare April 16, 2026 02:06
@tsorya

tsorya commented Apr 16, 2026

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@bitoku

bitoku commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

/retest

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from bbec2e9 to 179ae13 Compare April 16, 2026 22:35
@tsorya

tsorya commented Apr 17, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@tsorya tsorya force-pushed the continuous-cni-status-poll branch from 179ae13 to 954493d Compare April 17, 2026 00:26
@bitoku

bitoku commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

@tsorya
linter failed. make lint will fix them.

CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Made-with: Cursor
@tsorya tsorya force-pushed the continuous-cni-status-poll branch from 954493d to 8ff525c Compare April 17, 2026 12:43
@tsorya

tsorya commented Apr 17, 2026

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@tsorya

tsorya commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@bitoku

bitoku commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2026
@bitoku

bitoku commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

/retest

1 similar comment
@tsorya

tsorya commented Apr 20, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-bot openshift-merge-bot Bot merged commit 92586a0 into cri-o:main Apr 20, 2026
61 of 74 checks passed
@haircommander

Copy link
Copy Markdown
Member

/cherry-pick release-1.35

@openshift-cherrypick-robot

Copy link
Copy Markdown

@haircommander: #9855 failed to apply on top of branch "release-1.35":

Applying: cnimgr: continuously poll CNI STATUS to detect runtime health changes
Using index info to reconstruct a base tree...
M	internal/config/cnimgr/cnimgr.go
M	internal/config/cnimgr/cnimgr_test_inject.go
M	server/server_test.go
M	server/suite_test.go
Falling back to patching base and 3-way merge...
Auto-merging server/suite_test.go
Auto-merging server/server_test.go
Auto-merging internal/config/cnimgr/cnimgr_test_inject.go
CONFLICT (content): Merge conflict in internal/config/cnimgr/cnimgr_test_inject.go
Auto-merging internal/config/cnimgr/cnimgr.go
CONFLICT (content): Merge conflict in internal/config/cnimgr/cnimgr.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 cnimgr: continuously poll CNI STATUS to detect runtime health changes

Details

In response to this:

/cherry-pick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants