rdma: fix flaky tests that assume hardcoded device name#1169
rdma: fix flaky tests that assume hardcoded device name#1169aboch merged 1 commit intovishvananda:mainfrom
Conversation
📝 WalkthroughWalkthroughReplace hard-coded RDMA device name usages in tests with a helper that selects the first available RDMA device; store original names when renaming; skip tests if no RDMA devices are found; adjust netns-related flows and error handling to use the detected device name. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TestRdmaGetRdmaLink, TestRdmaSetRdmaLinkName, and TestRdmaLinkSetNsFd hardcode the RDMA device name "foo". When ib_core is loaded but no device named "foo" exists, these tests fail instead of skipping. Replace the hardcoded name with a helper that lists available RDMA devices and uses the first one, or skips the test if none are found. Also skip TestRdmaLinkSetNsFd when switching to exclusive netns mode fails with "device or resource busy", which happens when RDMA devices are actively in use on the CI runner. Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com> Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rdma_link_test.go (1)
117-160:⚠️ Potential issue | 🟠 Major
firstRdmaDevicecalled after state mutation leaksnewnsand RDMA mode — likely cause of the CI failure.By line 144, two mutations are already in effect:
- RDMA system mode has been set to
"exclusive"(line 126).newnshas been created (line 137) but is not deferred for close.If
firstRdmaDevice(t)callst.Skip()ort.Fatal()(e.g., no RDMA devices in the CI environment),runtime.Goexit()runs only deferred functions. At that point onlydefer basens.Close()is registered, so:
newnsfd is leaked.- The system RDMA mode remains
"exclusive", potentially corrupting subsequent tests.The 0.00s test duration in the CI failure is consistent with an early skip/bail at this point.
Move
firstRdmaDevice(t)to the top of the test before any state change, or add explicit defers for cleanup before it is called:🛠️ Proposed fix — hoist `firstRdmaDevice` before any state mutation
func TestRdmaLinkSetNsFd(t *testing.T) { minKernelRequired(t, 5, 2) setupRdmaKModule(t, "ib_core") + link := firstRdmaDevice(t) mode, err := RdmaSystemGetNetnsMode() if err != nil { t.Fatal(err) } t.Log("current rdma netns mode", mode) err = RdmaSystemSetNetnsMode("exclusive") if err != nil { t.Fatal(err) } basens, err := netns.Get() if err != nil { RdmaSystemSetNetnsMode(mode) t.Fatal("Failed to get basens") } defer basens.Close() newns, err := netns.New() if err != nil { RdmaSystemSetNetnsMode(mode) t.Fatal("Failed to create newns") } + defer newns.Close() netns.Set(basens) - link := firstRdmaDevice(t) t.Log("rdma link: ", link) err = RdmaLinkSetNsFd(link, uint32(newns)) if err != nil { - newns.Close() RdmaSystemSetNetnsMode(mode) t.Fatal(err) } - newns.Close() //Set the old mode back at start of the test err = RdmaSystemSetNetnsMode(mode) if err != nil { t.Fatal(err) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rdma_link_test.go` around lines 117 - 160, The test TestRdmaLinkSetNsFd currently mutates global state (calls RdmaSystemSetNetnsMode("exclusive") and creates newns) before calling firstRdmaDevice, which can t.Skip/t.Fatal and leak newns and leave RDMA mode changed; move the call to firstRdmaDevice(t) to the top of TestRdmaLinkSetNsFd before any calls to RdmaSystemSetNetnsMode or netns.New, or if you must keep the order, immediately register cleanup defers (defer newns.Close() and defer RdmaSystemSetNetnsMode(mode)) as soon as newns is created and mode changed so that any early t.Skip/t.Fatal still runs the cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@rdma_link_test.go`:
- Around line 117-160: The test TestRdmaLinkSetNsFd currently mutates global
state (calls RdmaSystemSetNetnsMode("exclusive") and creates newns) before
calling firstRdmaDevice, which can t.Skip/t.Fatal and leak newns and leave RDMA
mode changed; move the call to firstRdmaDevice(t) to the top of
TestRdmaLinkSetNsFd before any calls to RdmaSystemSetNetnsMode or netns.New, or
if you must keep the order, immediately register cleanup defers (defer
newns.Close() and defer RdmaSystemSetNetnsMode(mode)) as soon as newns is
created and mode changed so that any early t.Skip/t.Fatal still runs the
cleanup.
cff5c33 to
2a75592
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rdma_link_test.go (1)
144-160:⚠️ Potential issue | 🟠 Major
RdmaSystemSetNetnsMode(mode)cleanup skipped iffirstRdmaDeviceexits the test.
firstRdmaDevice(t)at line 144 may callt.Skip()(no devices) ort.Fatal()(list error). Both exit viaruntime.Goexit(), which runs deferred calls but skips non-deferred code. Since the mode revert at line 156 is in the normal control flow and not deferred, the system is left in"exclusive"RDMA netns mode iffirstRdmaDeviceaborts the test at this point.The fix is to defer the mode revert immediately after it is changed (line 126):
🛡️ Proposed fix
err = RdmaSystemSetNetnsMode("exclusive") if err != nil { t.Skipf("Failed to set RDMA netns mode to exclusive: %v", err) } + defer RdmaSystemSetNetnsMode(mode) basens, err := netns.Get() if err != nil { - RdmaSystemSetNetnsMode(mode) t.Fatal("Failed to get basens") } defer basens.Close() newns, err := netns.New() if err != nil { - RdmaSystemSetNetnsMode(mode) t.Fatal("Failed to create newns") } netns.Set(basens) link := firstRdmaDevice(t) t.Log("rdma link: ", link) err = RdmaLinkSetNsFd(link, uint32(newns)) if err != nil { newns.Close() - RdmaSystemSetNetnsMode(mode) t.Fatal(err) } newns.Close() - //Set the old mode back at start of the test - err = RdmaSystemSetNetnsMode(mode) - if err != nil { - t.Fatal(err) - } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rdma_link_test.go` around lines 144 - 160, The test currently sets the RDMA netns mode (via RdmaSystemSetNetnsMode) but later reverts it only in normal control flow, so if firstRdmaDevice(t) calls t.Skip()/t.Fatal() the revert is skipped; fix by deferring the revert immediately after changing the mode: right after the call that switches mode (the earlier RdmaSystemSetNetnsMode call where you capture the original mode in variable mode), add a defer that calls RdmaSystemSetNetnsMode(mode) (handle/log any error inside the defer rather than using t.Fatal), and similarly defer newns.Close() as soon as newns is created so cleanup always runs even if firstRdmaDevice or subsequent calls like RdmaLinkSetNsFd or RdmaLinkSetNsFd(link, uint32(newns)) cause the test to exit early.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@rdma_link_test.go`:
- Around line 144-160: The test currently sets the RDMA netns mode (via
RdmaSystemSetNetnsMode) but later reverts it only in normal control flow, so if
firstRdmaDevice(t) calls t.Skip()/t.Fatal() the revert is skipped; fix by
deferring the revert immediately after changing the mode: right after the call
that switches mode (the earlier RdmaSystemSetNetnsMode call where you capture
the original mode in variable mode), add a defer that calls
RdmaSystemSetNetnsMode(mode) (handle/log any error inside the defer rather than
using t.Fatal), and similarly defer newns.Close() as soon as newns is created so
cleanup always runs even if firstRdmaDevice or subsequent calls like
RdmaLinkSetNsFd or RdmaLinkSetNsFd(link, uint32(newns)) cause the test to exit
early.
|
lgtm. @vishvananda @aboch can you remind me why we have in this lib such names like I guess that we had some agreement on that several years ago, but tbh - i have already lost that memory... |
|
LGTM |
TestRdmaGetRdmaLink, TestRdmaSetRdmaLinkName, and TestRdmaLinkSetNsFd
hardcode the RDMA device name "foo". When ib_core is loaded but no
device named "foo" exists, these tests fail instead of skipping.
Replace the hardcoded name with a helper that lists available RDMA
devices and uses the first one, or skips the test if none are found.
Also skip TestRdmaLinkSetNsFd when switching to exclusive netns mode
fails with "device or resource busy", which happens when RDMA devices
are actively in use on the CI runner.
Signed-off-by: Ihar Hrachyshka ihrachyshka@nvidia.com
Assisted-By: Claude Opus 4.6 noreply@anthropic.com
Summary by CodeRabbit