cnimgr: continuously poll CNI STATUS to detect runtime health changes by tsorya · Pull Request #9855 · cri-o/cri-o

tsorya · 2026-03-28T03:28:14Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

This PR changes the polling to two stages:

Init poll (500ms): fast poll until first ready, triggers GC and
notifies watchers. Same as before.
Monitor poll (5s): continuous health check. Sets not-ready on
failure, self-heals on recovery.

Also replaces the shutdown bool with context cancellation for clean
goroutine lifecycle, and handles edge cases in AddWatcher for shutdown
and already-ready states.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-81474

Special notes for your reviewer:

The doGC call is still made under the mutex (inherited from the
original code). This could be improved in a follow-up by splitting the
lock around the I/O-heavy GC operation.

Does this PR introduce a user-facing change?

CRI-O now continuously monitors CNI plugin health using the STATUS
verb. If a plugin becomes unhealthy after initial readiness, the node
is reported as NetworkReady=false, preventing pod scheduling on
affected nodes. The node self-heals when the plugin recovers.

openshift-ci · 2026-03-28T03:28:26Z

Hi @tsorya. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-03-28T03:28:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaced the startup poll with a cancellable two-phase continuous poll (fast readiness, slow health monitor), added errShutdown, refactored status checks into statusPollFunc, changed watcher semantics and Shutdown behavior, and added tests for readiness, flapping, GC, and watchers.

Changes

Cohort / File(s)	Summary
Core Polling Refactor `internal/config/cnimgr/cnimgr.go`	Added `errShutdown`, replaced boolean shutdown with cancellable polling context, introduced `initPollInterval`/`monitorPollInterval`, replaced `pollFunc`/`pollUntilReady` with `pollContinuously(ctx)` using `statusPollFunc(ctx, stopOnReady)`, adjusted watcher handling (non-blocking notifies, clearing), updated `Shutdown()` to cancel context and set `lastError = errShutdown`, and updated GC defer comments.
New Tests `internal/config/cnimgr/cnimgr_test.go`	New test suite with `fakeCNIPlugin`, `newTestManager`, and `TestStatusPolling` covering readiness transitions, flapping, watcher notifications (immediate and on transitions, non-blocking), GC invocation behavior, and shutdown semantics.
Test Injection Adjust `internal/config/cnimgr/cnimgr_test_inject.go`	Test-only change: `SetCNIPlugin` now invokes `c.statusPollFunc(context.Background(), false)` (with `//nolint:errcheck`) instead of calling `c.pollFunc()`; added `context` import. No public API changes.

Sequence Diagram

sequenceDiagram
    participant Client
    participant CNIManager
    participant FastPhase as Phase 1 (Fast Readiness)
    participant SlowPhase as Phase 2 (Health Monitor)
    participant Plugin
    participant GC
    participant Watchers

    Client->>CNIManager: Start polling
    CNIManager->>FastPhase: pollContinuously(ctx) -> start fast phase

    loop Fast Phase (short interval)
        FastPhase->>Plugin: StatusWithContext(ctx)
        alt healthy
            Plugin-->>FastPhase: nil
            FastPhase->>GC: doGC(ctx)
            GC-->>FastPhase: done
            FastPhase->>Watchers: send true (non-blocking)
            FastPhase->>SlowPhase: transition to slow monitoring
        else unhealthy
            Plugin-->>FastPhase: error
            FastPhase->>FastPhase: continue fast polling
        end
    end

    loop Slow Phase (long interval)
        SlowPhase->>Plugin: StatusWithContext(ctx)
        alt recovery (error -> nil)
            Plugin-->>SlowPhase: nil
            SlowPhase->>GC: doGC(ctx)
            SlowPhase->>Watchers: send true (non-blocking)
        else failure (nil -> error)
            Plugin-->>SlowPhase: error
            SlowPhase->>Watchers: send false (non-blocking)
        else unchanged
            SlowPhase->>SlowPhase: continue monitoring
        end
    end

    Client->>CNIManager: Shutdown()
    CNIManager->>CNIManager: cancel polling context
    CNIManager->>CNIManager: lastError = errShutdown
    CNIManager->>Watchers: send false (non-blocking) and clear watchers

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through short and slow patrols,

I peeked at plugins, counted holes,
When health returned I swept the floor,
I nudged the watchers, then no more,
Now quiet paws rest after roles.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: introducing continuous CNI STATUS polling to detect runtime health changes, which is the core objective of this PR.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/config/cnimgr/cnimgr_test_inject.go`:
- Around line 23-24: The injected test call passes a nil context into
c.statusPollFunc which can lead to a panic when statusPollFunc invokes doGC that
uses context.WithTimeout; update the test to pass a real context (e.g.,
context.Background() or context.TODO()) into the call to c.statusPollFunc
instead of nil so statusPollFunc and doGC receive a valid context.

In `@internal/config/cnimgr/cnimgr.go`:
- Around line 91-99: When broadcasting readiness/shutdown, treat watcher
channels as terminal: in the loop that iterates c.watchers (the non-blocking
send shown) send the signal, close each watcher channel, and then clear
c.watchers so stale waiters are not revisited; additionally modify AddWatcher()
to check c.lastError (and errShutdown) under the same lock and if set return a
channel that is already signaled/closed instead of appending a new watcher to
c.watchers. This ensures Shutdown()/readiness broadcasts fully terminate
watchers and prevents callers from being appended to a canceled poller and
blocking forever.
- Around line 54-60: In pollContinuously, statusPollFunc is calling
c.plugin.Status() without the request context which can block while holding the
mutex; update statusPollFunc (used in pollContinuously) to call
c.plugin.StatusWithContext(ctx) and pass the provided ctx into that call (and do
the same for the second-phase polling lambda), so status checks honor
cancellation and do not stall ReadyOrError, AddWatcher, or Shutdown; locate the
statusPollFunc closures referenced in pollContinuously and replace Status() with
StatusWithContext(ctx).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c53c07ed-0ac5-43e2-8c89-a60320631769

📥 Commits

Reviewing files that changed from the base of the PR and between 5753c52 and 4cb7a9b.

📒 Files selected for processing (3)

internal/config/cnimgr/cnimgr.go
internal/config/cnimgr/cnimgr_test.go
internal/config/cnimgr/cnimgr_test_inject.go

coderabbitai

♻️ Duplicate comments (2)

internal/config/cnimgr/cnimgr.go (2)

94-99: ⚠️ Potential issue | 🔴 Critical

Treat watcher channels as terminal after readiness/shutdown.

AddWatcher() currently appends when lastError != nil, including errShutdown (Line 127), so callers can get parked forever after shutdown. Also, watcher lists are not cleared after notifications (Lines 94-99 and 150-155), so stale channels are revisited repeatedly.

Suggested fix

 func (c *CNIManager) AddWatcher() chan bool {
 	c.mutex.Lock()
 	defer c.mutex.Unlock()

 	watcher := make(chan bool, 1)
-	if c.lastError == nil {
+	switch {
+	case c.lastError == nil:
 		watcher <- true
-	} else {
+		close(watcher)
+	case errors.Is(c.lastError, errShutdown):
+		watcher <- false
+		close(watcher)
+	default:
 		c.watchers = append(c.watchers, watcher)
 	}

 	return watcher
 }

 		for _, watcher := range c.watchers {
 			select {
 			case watcher <- true:
 			default:
 			}
+			close(watcher)
 		}
+		c.watchers = nil
 	}

 	for _, watcher := range c.watchers {
 		select {
 		case watcher <- false:
 		default:
 		}
+		close(watcher)
 	}
+	c.watchers = nil
 }

Also applies to: 122-133, 136-155

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 94 - 99, The watcher channels
are treated as persistent but must be terminal: update AddWatcher() to NOT
append new watchers when c.lastError != nil (including errShutdown) so callers
don't get parked after shutdown, and change the notification logic in the loop
that iterates c.watchers (and the similar block around lines 136-155) to treat
channels as one‑time: send the readiness/shutdown signal non-blockingly to each
watcher (select { case watcher <- true: default: }), then remove/clear that
watcher from the slice (or set c.watchers = nil) so stale channels are not
revisited; ensure this uses the existing symbols AddWatcher, c.watchers,
c.lastError and errShutdown so behavior is consistent across readiness and
shutdown paths.

72-76: ⚠️ Potential issue | 🟠 Major

Use context-aware CNI status checks (and avoid status call under the lock).

At Line 76, c.plugin.Status() ignores the provided ctx. If status blocks, it can hold the write lock longer than necessary and stall ReadyOrError(), AddWatcher(), and Shutdown() paths.

Suggested fix

 func (c *CNIManager) statusPollFunc(ctx context.Context, stopOnReady bool) (bool, error) {
-	c.mutex.Lock()
-	defer c.mutex.Unlock()
-
-	if err := c.plugin.Status(); err != nil {
+	if err := c.plugin.StatusWithContext(ctx); err != nil {
+		c.mutex.Lock()
 		c.lastError = err
+		c.mutex.Unlock()
 		return false, nil
 	}
 
+	c.mutex.Lock()
+	defer c.mutex.Unlock()
+
 	if c.lastError != nil {
 		c.lastError = nil

#!/bin/bash
set -euo pipefail
# Verify ocicni interface methods and current cnimgr usage.
rg -n -C2 'type CNIPlugin interface|StatusWithContext\(|Status\(' \
  vendor/github.com/cri-o/ocicni/pkg/ocicni/types.go \
  internal/config/cnimgr/cnimgr.go

As per coding guidelines, "Propagate context.Context through function calls in Go code".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 72 - 76, The statusPollFunc
currently calls c.plugin.Status() while holding c.mutex and without passing ctx;
change it to avoid calling the blocking status under the lock and to use a
context-aware API: grab a local reference to the plugin (e.g., p := c.plugin)
while holding c.mutex, release the lock, then call the context-aware status
method (preferably p.StatusWithContext(ctx) if available, or add/use a
StatusWithContext wrapper on the CNI plugin) and only re-acquire the lock to
update manager state based on the result; update statusPollFunc, and ensure
callers like ReadyOrError(), AddWatcher(), and Shutdown() are not blocked by the
status call.

🧹 Nitpick comments (1)

internal/config/cnimgr/cnimgr.go (1)

88-90: Prefer structured logrus fields for GC error logging.

At Line 89, use structured logging instead of Warnf so error data is consistently queryable.

Suggested refactor

-			logrus.Warnf("Garbage collect stale network resources failed: %v", err)
+			logrus.WithError(err).Warn("Garbage collect stale network resources failed")

As per coding guidelines, "Use logrus with structured fields for logging in Go code".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 88 - 90, Replace the
unstructured Warnf call that logs GC errors with a structured logrus entry: when
c.doGC(ctx) returns an error, call logrus.WithError(err).Warn("Garbage collect
stale network resources failed") (or use
logrus.WithFields(logrus.Fields{"error": err}).Warn(...)) so the c.doGC failure
is logged using structured fields; update the block that currently uses
logrus.Warnf to use logrus.WithError(err).Warn and keep the existing message
text.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 94-99: The watcher channels are treated as persistent but must be
terminal: update AddWatcher() to NOT append new watchers when c.lastError != nil
(including errShutdown) so callers don't get parked after shutdown, and change
the notification logic in the loop that iterates c.watchers (and the similar
block around lines 136-155) to treat channels as one‑time: send the
readiness/shutdown signal non-blockingly to each watcher (select { case watcher
<- true: default: }), then remove/clear that watcher from the slice (or set
c.watchers = nil) so stale channels are not revisited; ensure this uses the
existing symbols AddWatcher, c.watchers, c.lastError and errShutdown so behavior
is consistent across readiness and shutdown paths.
- Around line 72-76: The statusPollFunc currently calls c.plugin.Status() while
holding c.mutex and without passing ctx; change it to avoid calling the blocking
status under the lock and to use a context-aware API: grab a local reference to
the plugin (e.g., p := c.plugin) while holding c.mutex, release the lock, then
call the context-aware status method (preferably p.StatusWithContext(ctx) if
available, or add/use a StatusWithContext wrapper on the CNI plugin) and only
re-acquire the lock to update manager state based on the result; update
statusPollFunc, and ensure callers like ReadyOrError(), AddWatcher(), and
Shutdown() are not blocked by the status call.

---

Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 88-90: Replace the unstructured Warnf call that logs GC errors
with a structured logrus entry: when c.doGC(ctx) returns an error, call
logrus.WithError(err).Warn("Garbage collect stale network resources failed") (or
use logrus.WithFields(logrus.Fields{"error": err}).Warn(...)) so the c.doGC
failure is logged using structured fields; update the block that currently uses
logrus.Warnf to use logrus.WithError(err).Warn and keep the existing message
text.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bab20aa6-07ff-467c-98dd-d8b5fc57e3bd

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb7a9b and 09adc49.

📒 Files selected for processing (3)

internal/config/cnimgr/cnimgr.go
internal/config/cnimgr/cnimgr_test.go
internal/config/cnimgr/cnimgr_test_inject.go

✅ Files skipped from review due to trivial changes (1)

internal/config/cnimgr/cnimgr_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/config/cnimgr/cnimgr_test_inject.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

internal/config/cnimgr/cnimgr.go (1)
72-100: ⚠️ Potential issue | 🟠 Major

Follow-up: external CNI work still runs under c.mutex.

statusPollFunc keeps the write lock held across StatusWithContext, validPodList(), and plugin.GC(). If any of those block, ReadyOrError() on the kubelet status path (server/runtime_status.go:1-20) and watcher registration in server/sandbox_network.go:367-378 stall behind the poller. Please narrow the critical section to state transitions and watcher handoff, and run the external work outside the mutex.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 72 - 100, statusPollFunc
currently holds c.mutex while calling external/blocking work
(plugin.StatusWithContext, validPodList/doGC), causing other paths to stall;
change it so the mutex is only held to read/update internal state and to hand
off watchers. Concretely: acquire c.mutex to snapshot any required state (e.g.
c.lastError, copy of c.watchers) and then release the mutex before calling
plugin.StatusWithContext, validPodList(), or doGC; perform those calls and
compute the new state outside the lock, then re-acquire c.mutex to update
c.lastError, clear/replace c.watchers and perform the non-blocking watcher
sends. Ensure you use local variables for results and only mutate CNIManager
fields (c.lastError, c.watchers) while holding c.mutex to keep transitions
atomic.

🧹 Nitpick comments (2)

internal/config/cnimgr/cnimgr.go (2)
41-49: Pass a parent context into the poller.

The manager now owns a long-lived goroutine, but it is rooted in context.Background(), so its lifetime is detached from the caller and Shutdown() becomes the only cancellation path. Please accept a parent context.Context here and derive the internal cancelable context from that instead.

As per coding guidelines, "Propagate context.Context through function calls in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 41 - 49, The poller is created
from context.Background() so its lifetime is detached; change the constructor
that creates CNIManager to accept a parent context (e.g. ctx context.Context)
and derive the cancellable context with context.WithCancel(parentCtx) instead of
context.Background(), then pass that derived ctx into mgr.pollContinuously(ctx);
keep the existing mgr.cancel and Shutdown() behavior so Shutdown still calls
mgr.cancel() but ensure callers supply the parent context when creating the
CNIManager (update the function signature that instantiates CNIManager
accordingly).
88-90: Log GC failures as structured fields.

Please keep the error as a field here rather than formatting it into the message.
📝 Suggested change
-			logrus.Warnf("Garbage collect stale network resources failed: %v", err)
+			logrus.WithError(err).Warn("Garbage collect stale network resources failed")
As per coding guidelines, "Use logrus with structured fields for logging in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 88 - 90, The log currently
formats the error into the message using logrus.Warnf after calling c.doGC(ctx);
change it to use structured logging so the error is attached as a field (e.g.,
use logrus.WithError(err).Warn(...) or logrus.WithField("err", err).Warn(...))
and keep the original descriptive text "Garbage collect stale network resources
failed" as the message; update the error handling where c.doGC(ctx) is called
and replace the logrus.Warnf invocation accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/config/cnimgr/cnimgr_test.go`:
- Around line 62-72: newTestManager currently starts the immediate Phase 1 poll
by launching go mgr.pollContinuously(ctx) which races tests that set
mgr.validPodList; change newTestManager to construct and return the CNIManager
without starting the goroutine, add a small Start or StartPolling method (e.g.,
CNIManager.Start(ctx) or CNIManager.StartPolling) that calls pollContinuously,
and update tests to set mgr.validPodList before calling Start; refer to
newTestManager, CNIManager.pollContinuously, Start/StartPolling (new) and
validPodList when making the change.

---

Duplicate comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 72-100: statusPollFunc currently holds c.mutex while calling
external/blocking work (plugin.StatusWithContext, validPodList/doGC), causing
other paths to stall; change it so the mutex is only held to read/update
internal state and to hand off watchers. Concretely: acquire c.mutex to snapshot
any required state (e.g. c.lastError, copy of c.watchers) and then release the
mutex before calling plugin.StatusWithContext, validPodList(), or doGC; perform
those calls and compute the new state outside the lock, then re-acquire c.mutex
to update c.lastError, clear/replace c.watchers and perform the non-blocking
watcher sends. Ensure you use local variables for results and only mutate
CNIManager fields (c.lastError, c.watchers) while holding c.mutex to keep
transitions atomic.

---

Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 41-49: The poller is created from context.Background() so its
lifetime is detached; change the constructor that creates CNIManager to accept a
parent context (e.g. ctx context.Context) and derive the cancellable context
with context.WithCancel(parentCtx) instead of context.Background(), then pass
that derived ctx into mgr.pollContinuously(ctx); keep the existing mgr.cancel
and Shutdown() behavior so Shutdown still calls mgr.cancel() but ensure callers
supply the parent context when creating the CNIManager (update the function
signature that instantiates CNIManager accordingly).
- Around line 88-90: The log currently formats the error into the message using
logrus.Warnf after calling c.doGC(ctx); change it to use structured logging so
the error is attached as a field (e.g., use logrus.WithError(err).Warn(...) or
logrus.WithField("err", err).Warn(...)) and keep the original descriptive text
"Garbage collect stale network resources failed" as the message; update the
error handling where c.doGC(ctx) is called and replace the logrus.Warnf
invocation accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 84e0db6d-36dd-476e-a3a3-e4b7fe52880a

📥 Commits

Reviewing files that changed from the base of the PR and between 09adc49 and a73153a.

📒 Files selected for processing (3)

internal/config/cnimgr/cnimgr.go
internal/config/cnimgr/cnimgr_test.go
internal/config/cnimgr/cnimgr_test_inject.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/config/cnimgr/cnimgr_test_inject.go

coderabbitai

♻️ Duplicate comments (1)

internal/config/cnimgr/cnimgr_test.go (1)

62-72: ⚠️ Potential issue | 🟡 Minor

newTestManager still races startup readiness in GC tests.

Because polling starts immediately, subtests that assign mgr.validPodList after construction can miss the first ready transition and intermittently skip the GC call they assert on.

💡 Deterministic helper pattern

-func newTestManager(plugin *fakeCNIPlugin) *CNIManager {
+func newTestManager(plugin *fakeCNIPlugin, preStart func(*CNIManager)) *CNIManager {
 	ctx, cancel := context.WithCancel(context.Background())
 	mgr := &CNIManager{
 		plugin:         plugin,
 		lastError:      errors.New("plugin status uninitialized"),
 		cancel:         cancel,
 		initPollInterval:    testPollInterval,
 		monitorPollInterval: testPollInterval,
 	}
+	if preStart != nil {
+		preStart(mgr)
+	}
 	go mgr.pollContinuously(ctx)
 	return mgr
}

Then GC tests can install validPodList in preStart before polling begins.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr_test.go` around lines 62 - 72, newTestManager
starts the pollContinuously goroutine immediately which races tests that want to
set mgr.validPodList before polling begins; change newTestManager to not start
polling itself (remove the go mgr.pollContinuously(ctx) call) and instead
provide a deterministic start hook (e.g., add a Start/BeginPolling method or
expose PollContinuouslyStart(ctx) that calls mgr.pollContinuously) so tests can
set mgr.validPodList in their preStart setup and then invoke mgr.Start(ctx) to
begin polling; update tests to call the new start method after installing
validPodList and keep references to pollContinuously and validPodList to locate
the affected code.

🧹 Nitpick comments (1)

internal/config/cnimgr/cnimgr.go (1)

89-90: Use structured logrus fields for error logging.

Replace logrus.Warnf() with logrus.WithError(err).Warn() to align with structured logging conventions and enable better queryability in logs.

Suggested change

-		if err := c.doGC(ctx); err != nil {
-			logrus.Warnf("Garbage collect stale network resources failed: %v", err)
-		}
+		if err := c.doGC(ctx); err != nil {
+			logrus.WithError(err).Warn("Garbage collect stale network resources failed")
+		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/config/cnimgr/cnimgr.go` around lines 89 - 90, Replace the
unstructured log call logrus.Warnf("Garbage collect stale network resources
failed: %v", err) with a structured log using the error field: use
logrus.WithError(err).Warn("Garbage collect stale network resources failed") so
the same message is logged but the error is attached as structured data; update
the occurrence where logrus.Warnf is called and ensure the err variable is
passed into WithError.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/config/cnimgr/cnimgr_test.go`:
- Around line 62-72: newTestManager starts the pollContinuously goroutine
immediately which races tests that want to set mgr.validPodList before polling
begins; change newTestManager to not start polling itself (remove the go
mgr.pollContinuously(ctx) call) and instead provide a deterministic start hook
(e.g., add a Start/BeginPolling method or expose PollContinuouslyStart(ctx) that
calls mgr.pollContinuously) so tests can set mgr.validPodList in their preStart
setup and then invoke mgr.Start(ctx) to begin polling; update tests to call the
new start method after installing validPodList and keep references to
pollContinuously and validPodList to locate the affected code.

---

Nitpick comments:
In `@internal/config/cnimgr/cnimgr.go`:
- Around line 89-90: Replace the unstructured log call logrus.Warnf("Garbage
collect stale network resources failed: %v", err) with a structured log using
the error field: use logrus.WithError(err).Warn("Garbage collect stale network
resources failed") so the same message is logged but the error is attached as
structured data; update the occurrence where logrus.Warnf is called and ensure
the err variable is passed into WithError.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9c96217-119c-4261-ac6b-5327fc226cc2

📥 Commits

Reviewing files that changed from the base of the PR and between 5de4f2b and 433918e.

📒 Files selected for processing (3)

internal/config/cnimgr/cnimgr.go
internal/config/cnimgr/cnimgr_test.go
internal/config/cnimgr/cnimgr_test_inject.go

haircommander · 2026-04-08T18:53:52Z

-	c.lastError = nil
-	for _, watcher := range c.watchers {
-		watcher <- true
+	if c.lastError != nil {


hm I'm not sure we still want to do the GC if the plugin went down and came back up, but rather just once on startup. I think this should only happen on stopOnReady (or, maybe better, we should pull this section out and add it after we've done phase 1 but before we start phase 2, assuming they're done synchronously?)

oh wait no we need to do gc before watchers, ignore the second option

sure, i ll change a name for stopOnReady, will not match in this case

Can you update the PR description as well for posterity?

haircommander · 2026-04-10T19:22:57Z

/approve

LGTM, @cri-o/cri-o-maintainers PTAL

openshift-ci · 2026-04-10T19:23:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, tsorya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bitoku · 2026-04-13T12:32:06Z

/ok-to-test

bitoku · 2026-04-13T12:34:23Z

@tsorya can you update the PR description to include release note? (see PR template)

tsorya · 2026-04-14T16:23:15Z

/retest-required

tsorya · 2026-04-15T02:10:24Z

/retest-required

bitoku · 2026-04-15T02:40:57Z

/retest

tsorya · 2026-04-15T14:35:23Z

/retest

codecov · 2026-04-15T15:04:29Z

Codecov Report

❌ Patch coverage is 89.39394% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.40%. Comparing base (c4905cc) to head (8ff525c).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9855      +/-   ##
==========================================
- Coverage   67.57%   67.40%   -0.17%     
==========================================
  Files         214      215       +1     
  Lines       29607    29657      +50     
==========================================
- Hits        20007    19991      -16     
- Misses       7903     7962      +59     
- Partials     1697     1704       +7

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tsorya · 2026-04-16T02:04:29Z

go: go.mod requires go >= 1.26.2 (running go 1.26.0; GOTOOLCHAIN=local)
make: *** [Makefile:101: /tmp/tmp.N7UFu2Rj1s/build/bin/linux/amd64/critest] Error 1

seems not relevant to the change

tsorya · 2026-04-16T03:27:30Z

/retest

bitoku · 2026-04-16T15:39:52Z

/retest

tsorya · 2026-04-17T00:20:47Z

/retest

bitoku · 2026-04-17T10:32:17Z

@tsorya
linter failed. make lint will fix them.

CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <itsoiref@redhat.com> Made-with: Cursor

tsorya · 2026-04-17T17:51:37Z

/retest

tsorya · 2026-04-19T01:46:13Z

/retest

bitoku · 2026-04-19T12:32:28Z

/lgtm

bitoku · 2026-04-19T12:32:34Z

/retest

tsorya · 2026-04-20T00:22:02Z

/retest

haircommander · 2026-04-23T14:41:04Z

/cherry-pick release-1.35

openshift-cherrypick-robot · 2026-04-23T14:42:05Z

@haircommander: #9855 failed to apply on top of branch "release-1.35":

Applying: cnimgr: continuously poll CNI STATUS to detect runtime health changes
Using index info to reconstruct a base tree...
M	internal/config/cnimgr/cnimgr.go
M	internal/config/cnimgr/cnimgr_test_inject.go
M	server/server_test.go
M	server/suite_test.go
Falling back to patching base and 3-way merge...
Auto-merging server/suite_test.go
Auto-merging server/server_test.go
Auto-merging internal/config/cnimgr/cnimgr_test_inject.go
CONFLICT (content): Merge conflict in internal/config/cnimgr/cnimgr_test_inject.go
Auto-merging internal/config/cnimgr/cnimgr.go
CONFLICT (content): Merge conflict in internal/config/cnimgr/cnimgr.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 cnimgr: continuously poll CNI STATUS to detect runtime health changes

Details

In response to this:

/cherry-pick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tsorya requested a review from mrunalp as a code owner March 28, 2026 03:28

openshift-ci Bot requested review from bitoku and littlejawa March 28, 2026 03:28

tsorya force-pushed the continuous-cni-status-poll branch from 4cb7a9b to 09adc49 Compare March 28, 2026 03:33

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread internal/config/cnimgr/cnimgr_test_inject.go Outdated

Comment thread internal/config/cnimgr/cnimgr.go

Comment thread internal/config/cnimgr/cnimgr.go

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

tsorya force-pushed the continuous-cni-status-poll branch from 09adc49 to a73153a Compare March 29, 2026 03:43

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread internal/config/cnimgr/cnimgr_test.go

tsorya force-pushed the continuous-cni-status-poll branch 5 times, most recently from 433918e to 5eb101c Compare April 7, 2026 21:38

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

haircommander reviewed Apr 8, 2026

View reviewed changes

tsorya force-pushed the continuous-cni-status-poll branch 2 times, most recently from 6f837fd to d0a8aa8 Compare April 9, 2026 02:04

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2026

openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 13, 2026

tsorya force-pushed the continuous-cni-status-poll branch from d0a8aa8 to 451a735 Compare April 13, 2026 17:37

openshift-ci Bot removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 14, 2026

tsorya force-pushed the continuous-cni-status-poll branch from 60857a0 to aac7feb Compare April 14, 2026 13:23

tsorya force-pushed the continuous-cni-status-poll branch from aac7feb to 46e4be2 Compare April 15, 2026 13:37

tsorya force-pushed the continuous-cni-status-poll branch from 46e4be2 to bbec2e9 Compare April 16, 2026 02:06

tsorya force-pushed the continuous-cni-status-poll branch from bbec2e9 to 179ae13 Compare April 16, 2026 22:35

tsorya force-pushed the continuous-cni-status-poll branch from 179ae13 to 954493d Compare April 17, 2026 00:26

tsorya force-pushed the continuous-cni-status-poll branch from 954493d to 8ff525c Compare April 17, 2026 12:43

openshift-ci Bot assigned bitoku Apr 19, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2026

openshift-merge-bot Bot merged commit 92586a0 into cri-o:main Apr 20, 2026
61 of 74 checks passed

haircommander mentioned this pull request Apr 23, 2026

OCPBUGS-81474: [1.35] cnimgr: continuously poll CNI STATUS to detect runtime health changes #9903

Merged

coderabbitai Bot mentioned this pull request May 1, 2026

TRT-2646: gate continuous CNI STATUS monitoring behind config flag and provide a tolerance window #9918

Open

Conversation

tsorya commented Mar 28, 2026 • edited by bitoku Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

openshift-ci Bot commented Mar 28, 2026

Uh oh!

coderabbitai Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

haircommander Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

haircommander Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

tsorya Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

tsorya Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

bitoku Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

haircommander commented Apr 10, 2026

Uh oh!

openshift-ci Bot commented Apr 10, 2026

Uh oh!

bitoku commented Apr 13, 2026

Uh oh!

bitoku commented Apr 13, 2026

Uh oh!

tsorya commented Apr 14, 2026

Uh oh!

tsorya commented Apr 15, 2026

Uh oh!

bitoku commented Apr 15, 2026

Uh oh!

tsorya commented Apr 15, 2026

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tsorya commented Apr 16, 2026

Uh oh!

tsorya commented Apr 16, 2026

Uh oh!

bitoku commented Apr 16, 2026

Uh oh!

tsorya commented Apr 17, 2026

Uh oh!

tsorya commented Mar 28, 2026 •

edited by bitoku

Loading

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading