Skip to content

rdma: fix flaky tests that assume hardcoded device name#1169

Merged
aboch merged 1 commit intovishvananda:mainfrom
booxter:fix-rdma-flaky-tests
Mar 10, 2026
Merged

rdma: fix flaky tests that assume hardcoded device name#1169
aboch merged 1 commit intovishvananda:mainfrom
booxter:fix-rdma-flaky-tests

Conversation

@booxter
Copy link
Copy Markdown
Contributor

@booxter booxter commented Feb 25, 2026

TestRdmaGetRdmaLink, TestRdmaSetRdmaLinkName, and TestRdmaLinkSetNsFd
hardcode the RDMA device name "foo". When ib_core is loaded but no
device named "foo" exists, these tests fail instead of skipping.

Replace the hardcoded name with a helper that lists available RDMA
devices and uses the first one, or skips the test if none are found.

Also skip TestRdmaLinkSetNsFd when switching to exclusive netns mode
fails with "device or resource busy", which happens when RDMA devices
are actively in use on the CI runner.

Signed-off-by: Ihar Hrachyshka ihrachyshka@nvidia.com
Assisted-By: Claude Opus 4.6 noreply@anthropic.com

Summary by CodeRabbit

  • Tests
    • Improved test reliability by dynamically detecting RDMA devices instead of relying on fixed names.
    • Tests now preserve and restore original device names and gracefully skip when no devices or netns setup are available, reducing false failures.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

Replace hard-coded RDMA device name usages in tests with a helper that selects the first available RDMA device; store original names when renaming; skip tests if no RDMA devices are found; adjust netns-related flows and error handling to use the detected device name.

Changes

Cohort / File(s) Summary
RDMA Link Test Refactor
rdma_link_test.go
Add helper to obtain first RDMA device and original name; replace RdmaLinkByName("foo") calls with helper-provided name; update rename/revert logic to use captured original name; convert a fatal netns setup error to a test skip when appropriate; add skip guard if no RDMA devices present.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through tests to find the one,
I sniffed the links till day was done.
No "foo" now — the first device is known,
I rename, revert, and skip when none. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing flaky tests that rely on a hardcoded RDMA device name by replacing it with dynamic device detection.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

TestRdmaGetRdmaLink, TestRdmaSetRdmaLinkName, and TestRdmaLinkSetNsFd
hardcode the RDMA device name "foo". When ib_core is loaded but no
device named "foo" exists, these tests fail instead of skipping.

Replace the hardcoded name with a helper that lists available RDMA
devices and uses the first one, or skips the test if none are found.

Also skip TestRdmaLinkSetNsFd when switching to exclusive netns mode
fails with "device or resource busy", which happens when RDMA devices
are actively in use on the CI runner.

Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rdma_link_test.go (1)

117-160: ⚠️ Potential issue | 🟠 Major

firstRdmaDevice called after state mutation leaks newns and RDMA mode — likely cause of the CI failure.

By line 144, two mutations are already in effect:

  1. RDMA system mode has been set to "exclusive" (line 126).
  2. newns has been created (line 137) but is not deferred for close.

If firstRdmaDevice(t) calls t.Skip() or t.Fatal() (e.g., no RDMA devices in the CI environment), runtime.Goexit() runs only deferred functions. At that point only defer basens.Close() is registered, so:

  • newns fd is leaked.
  • The system RDMA mode remains "exclusive", potentially corrupting subsequent tests.

The 0.00s test duration in the CI failure is consistent with an early skip/bail at this point.

Move firstRdmaDevice(t) to the top of the test before any state change, or add explicit defers for cleanup before it is called:

🛠️ Proposed fix — hoist `firstRdmaDevice` before any state mutation
 func TestRdmaLinkSetNsFd(t *testing.T) {
 	minKernelRequired(t, 5, 2)
 	setupRdmaKModule(t, "ib_core")
+	link := firstRdmaDevice(t)
 
 	mode, err := RdmaSystemGetNetnsMode()
 	if err != nil {
 		t.Fatal(err)
 	}
 	t.Log("current rdma netns mode", mode)
 	err = RdmaSystemSetNetnsMode("exclusive")
 	if err != nil {
 		t.Fatal(err)
 	}
 	basens, err := netns.Get()
 	if err != nil {
 		RdmaSystemSetNetnsMode(mode)
 		t.Fatal("Failed to get basens")
 	}
 	defer basens.Close()
 
 	newns, err := netns.New()
 	if err != nil {
 		RdmaSystemSetNetnsMode(mode)
 		t.Fatal("Failed to create newns")
 	}
+	defer newns.Close()
 
 	netns.Set(basens)
-	link := firstRdmaDevice(t)
 	t.Log("rdma link: ", link)
 
 	err = RdmaLinkSetNsFd(link, uint32(newns))
 	if err != nil {
-		newns.Close()
 		RdmaSystemSetNetnsMode(mode)
 		t.Fatal(err)
 	}
 
-	newns.Close()
 	//Set the old mode back at start of the test
 	err = RdmaSystemSetNetnsMode(mode)
 	if err != nil {
 		t.Fatal(err)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rdma_link_test.go` around lines 117 - 160, The test TestRdmaLinkSetNsFd
currently mutates global state (calls RdmaSystemSetNetnsMode("exclusive") and
creates newns) before calling firstRdmaDevice, which can t.Skip/t.Fatal and leak
newns and leave RDMA mode changed; move the call to firstRdmaDevice(t) to the
top of TestRdmaLinkSetNsFd before any calls to RdmaSystemSetNetnsMode or
netns.New, or if you must keep the order, immediately register cleanup defers
(defer newns.Close() and defer RdmaSystemSetNetnsMode(mode)) as soon as newns is
created and mode changed so that any early t.Skip/t.Fatal still runs the
cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@rdma_link_test.go`:
- Around line 117-160: The test TestRdmaLinkSetNsFd currently mutates global
state (calls RdmaSystemSetNetnsMode("exclusive") and creates newns) before
calling firstRdmaDevice, which can t.Skip/t.Fatal and leak newns and leave RDMA
mode changed; move the call to firstRdmaDevice(t) to the top of
TestRdmaLinkSetNsFd before any calls to RdmaSystemSetNetnsMode or netns.New, or
if you must keep the order, immediately register cleanup defers (defer
newns.Close() and defer RdmaSystemSetNetnsMode(mode)) as soon as newns is
created and mode changed so that any early t.Skip/t.Fatal still runs the
cleanup.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7039a4 and cff5c33.

📒 Files selected for processing (1)
  • rdma_link_test.go

@booxter booxter force-pushed the fix-rdma-flaky-tests branch from cff5c33 to 2a75592 Compare February 25, 2026 15:41
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rdma_link_test.go (1)

144-160: ⚠️ Potential issue | 🟠 Major

RdmaSystemSetNetnsMode(mode) cleanup skipped if firstRdmaDevice exits the test.

firstRdmaDevice(t) at line 144 may call t.Skip() (no devices) or t.Fatal() (list error). Both exit via runtime.Goexit(), which runs deferred calls but skips non-deferred code. Since the mode revert at line 156 is in the normal control flow and not deferred, the system is left in "exclusive" RDMA netns mode if firstRdmaDevice aborts the test at this point.

The fix is to defer the mode revert immediately after it is changed (line 126):

🛡️ Proposed fix
 	err = RdmaSystemSetNetnsMode("exclusive")
 	if err != nil {
 		t.Skipf("Failed to set RDMA netns mode to exclusive: %v", err)
 	}
+	defer RdmaSystemSetNetnsMode(mode)
 	basens, err := netns.Get()
 	if err != nil {
-		RdmaSystemSetNetnsMode(mode)
 		t.Fatal("Failed to get basens")
 	}
 	defer basens.Close()

 	newns, err := netns.New()
 	if err != nil {
-		RdmaSystemSetNetnsMode(mode)
 		t.Fatal("Failed to create newns")
 	}

 	netns.Set(basens)
 	link := firstRdmaDevice(t)
 	t.Log("rdma link: ", link)

 	err = RdmaLinkSetNsFd(link, uint32(newns))
 	if err != nil {
 		newns.Close()
-		RdmaSystemSetNetnsMode(mode)
 		t.Fatal(err)
 	}

 	newns.Close()
-	//Set the old mode back at start of the test
-	err = RdmaSystemSetNetnsMode(mode)
-	if err != nil {
-		t.Fatal(err)
-	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rdma_link_test.go` around lines 144 - 160, The test currently sets the RDMA
netns mode (via RdmaSystemSetNetnsMode) but later reverts it only in normal
control flow, so if firstRdmaDevice(t) calls t.Skip()/t.Fatal() the revert is
skipped; fix by deferring the revert immediately after changing the mode: right
after the call that switches mode (the earlier RdmaSystemSetNetnsMode call where
you capture the original mode in variable mode), add a defer that calls
RdmaSystemSetNetnsMode(mode) (handle/log any error inside the defer rather than
using t.Fatal), and similarly defer newns.Close() as soon as newns is created so
cleanup always runs even if firstRdmaDevice or subsequent calls like
RdmaLinkSetNsFd or RdmaLinkSetNsFd(link, uint32(newns)) cause the test to exit
early.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@rdma_link_test.go`:
- Around line 144-160: The test currently sets the RDMA netns mode (via
RdmaSystemSetNetnsMode) but later reverts it only in normal control flow, so if
firstRdmaDevice(t) calls t.Skip()/t.Fatal() the revert is skipped; fix by
deferring the revert immediately after changing the mode: right after the call
that switches mode (the earlier RdmaSystemSetNetnsMode call where you capture
the original mode in variable mode), add a defer that calls
RdmaSystemSetNetnsMode(mode) (handle/log any error inside the defer rather than
using t.Fatal), and similarly defer newns.Close() as soon as newns is created so
cleanup always runs even if firstRdmaDevice or subsequent calls like
RdmaLinkSetNsFd or RdmaLinkSetNsFd(link, uint32(newns)) cause the test to exit
early.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cff5c33 and 2a75592.

📒 Files selected for processing (1)
  • rdma_link_test.go

@jellonek
Copy link
Copy Markdown
Contributor

lgtm.

@vishvananda @aboch can you remind me why we have in this lib such names like Rdma instead of RDMA?
That's afaik against golang community naming rules as described in https://go.dev/wiki/CodeReviewComments#initialisms

I guess that we had some agreement on that several years ago, but tbh - i have already lost that memory...

@aboch
Copy link
Copy Markdown
Collaborator

aboch commented Mar 10, 2026

LGTM

@aboch aboch merged commit c909b3a into vishvananda:main Mar 10, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants